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How to use this book 



For students 



Prerequisites for understanding the content in this book are a solid background in probability 
theory and linear algebra. If you are new to information theory, then there is enough 
background in this book to get you up to speed (Chapters [21 10, 12, and 13). Though, 



classics on information theory such as Cover and Thomas [57] and MacKay |189j could be 
helpful as a reference. If you are new to quantum mechanics, then there should be enough 
material in this book (Part II) to give you the background necessary for understanding 
quantum Shannon theory. The book of Nielsen and Chuang (sometimes known as "Mike 
and Ike") has become the standard starting point for students in quantum information 
science and might be helpful as well |197] . Some of the content in that book is available in 
Nielsen's dissertation |194j . If you are familiar with Shannon's information theory (at the 
level of Cover and Thomas [57], for example), then this book should be a helpful entry point 
into the field of quantum Shannon theory. We build on intuition developed classically to 
help in establishing schemes for communication over quantum channels. If you are familiar 
with quantum mechanics, it might still be worthwhile to review Part II because some content 
there might not be part of a standard course on quantum mechanics. 

The aim of this book is to develop "from the ground up" many of the major, exciting, pre- 
and post-millenium developments in the general area of study known as quantum Shannon 
theory. As such, we spend a significant amount of time on quantum mechanics for quan- 
tum information theory (Part II), we give a careful study of the important unit protocols 
of teleportation, super-dense coding, and entanglement distribution (Part III), and we de- 
velop many of the tools necessary for understanding information transmission or compression 
(Part IV). Parts V and VI are the culmination of this book, where all of the tools developed 
come into play for understanding many of the important results in quantum Shannon theory. 

For instructors 

This book could be useful for self-learning or as a reference, but one of the main goals is for 
it to be employed as an instructional aid for the classroom. To aid instructors in designing a 
course to suit their own needs, this book is available under a Creative Commons Attribution- 
NonCommercial-ShareAlike license. This means that you can modify and redistribute the 
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book as you wish, as long as you attribute the author of this book, you do not use it for 
commercial purposes, and you share a modification or derivative work under the same license 
(see http://creativecommons.Org/licenses/by-nc-sa/3.0/ for a readable summary of 
the terms of the license). These requirements can be waived if you obtain permission from the 
present author. By releasing the book under this license, I expect and encourage instructors 
to modify this book for their own needs. This will allow for the addition of new exercises, 
new developments in the theory, and the latest open problems. It might also be a helpful 
starting point for a book on a related topic, such as network quantum Shannon theory. 

I used an earlier version of this book in a one-semester course on quantum Shannon 
theory at McGill University during Winter semester 2011 (in many parts of the US, this 
semester is typically called "Spring semester"). We almost went through the entire book, 
but it might also be possible to spread the content over two semesters instead. Here is the 
order in which we proceeded: 

1. Introduction in Part I 

2. Quantum mechanics in Part II 

3. Unit protocols in Part III 



4. Chapter [9] on distance measures, Chapter 10 on classical information and entropy, and 



Chapter [11] on quantum information and entropy. 



5. The first part of Chapter 13 on classical typicality and Shannon compression. 



6. The first part of Chapter 14 on quantum typicality. 



7. Chapter 17 on Schumacher compression. 



8. Back to Chapters 13 and 14 for the method of types. 



9. Chapter 18 on entanglement concentration. 



10. Chapter 19 on classical communication. 



11. Chapter 20 on entanglement- assisted classical communication. 



12. The final explosion of results in Chapter 21 (one of which is a route to proving the 



achievability part of the quantum capacity theorem). 

The above order is just a particular order that suited the needs for the class at McGill, 
but other orders are of course possible. One could sacrifice the last part of Part III on the 
unit resource capacity region if there is no desire to cover the quantum dynamic capacity 
theorem. One could also focus on going from classical communication to private classical 
communication to quantum communication in order to develop some more intuition behind 
the quantum capacity theorem. 
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Other sources 

There are many other sources to obtain a background in quantum Shannon theory. The 
standard reference has become the book of Nielsen and Chuang J197J . but it does not feature 
any of the post-millenium results in quantum Shannon theory. Other books that cover some 
aspects of quantum Shannon theory are Hayashi's |128] and Holevo's |145j . Patrick Hayden 
has had a significant hand as a collaborative guide for many PhD and Masters' theses in 
quantum Shannon theory, during his time as a postdoctoral fellow at the California Institute 
of Technology and as a professor at McGill University. These include the theses of Yard [266] , 
Abeyesinghe [2], Savov [2101 121 lj . Dupuis (82], and Dutil [85]. All of these theses are excellent 
references. Naturally, Hayden also had a strong influence over the present author during the 
development of this book. 
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CHAPTER 1 



Concepts in Quantum Shannon 

Theory 



In these first few chapters, our aim is to establish a firm grounding so that we can address 
some fundamental questions regarding information transmission over quantum channels. 
This area of study has become known as "quantum Shannon theory" in the broader quan- 
tum information community, in order to distinguish this topic from other areas of study in 
quantum information science. In this text, we will use the terms "quantum Shannon the- 
ory" and "quantum information theory" somewhat interchangeably. We will begin by briefly 
overviewing several fundamental aspects of the quantum theory. Our study of the quantum 
theory, in this chapter and future ones, will be at an abstract level, without giving preference 
to any particular physical system such as a spin- 1/2 particle or a photon. This approach 
will be more beneficial for the purposes of our study, but, here and there, we will make some 
reference to actual physical systems to ground us in reality. 

You may be wondering, what is quantum Shannon theory and why do we name this 
area of study as such? In short, quantum Shannon theory is the study of the ultimate 
capability of noisy physical systems, governed by the laws of quantum mechanics, to preserve 
information and correlations. Quantum information theorists have chosen the name quantum 
Shannon theory to honor Claude Shannon, who single-handedly founded the field of classical 
information theory, with a groundbreaking 1948 paper |222j . In particular, the name refers 
to the asymptotic theory of quantum information, which is the main topic of study in this 
book. Information theorists since Shannon have dubbed him the "Einstein of the information 
age."[j The name quantum Shannon theory is fit to capture this area of study because we 
use quantum versions of Shannon's ideas to prove some of the main theorems in quantum 
Shannon theory. 

We prefer the title "quantum Shannon theory" over such titles as "quantum informa- 
tion science" or just "quantum information." These other titles are too broad, encompassing 
subjects as diverse as quantum computation, quantum algorithms, quantum complexity the- 



1 It is worthwhile to look up "Claude Shannon — Father of the Information Age" on YouTube and watch 
several reknowned information theorists speak with awe about "the founding father" of information theory. 
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ory, quantum communication complexity, entanglement theory, quantum key distribution, 
quantum error correction, and even the experimental implementation of quantum protocols. 
Quantum Shannon theory does overlap with some of the aforementioned subjects, such as 
quantum computation, entanglement theory, quantum key distribution, and quantum error 
correction, but the name "quantum Shannon theory" should evoke a certain paradigm for 
quantum communication with which the reader will become intimately familiar after some 
exposure to the topics in this book. For example, it is necessary for us to discuss quantum 
gates (a topic in quantum computing) because quantum Shannon-theoretic protocols exploit 



them to achieve certain information processing tasks. Also, in Chapter [22} we are interested 
in the ultimate limitation on the ability of a noisy quantum communication channel to trans- 
mit private information (information that is secret from any third party besides the intended 
receiver). This topic connects quantum Shannon theory with quantum key distribution be- 
cause the private information capacity of a noisy quantum channel is strongly related to the 
task of using the quantum channel to distribute a secret key. As a final connection, perhaps 
the most important theorem of quantum Shannon theory is the quantum capacity theorem. 
This theorem determines the ultimate rate at which a quantum channel can reliably transmit 
quantum information to a receiver. The result provided by the quantum capacity theorem is 
closely related to the theory of quantum error correction, but the mathematical techniques 
used in quantum Shannon theory and in quantum error correction are so different that these 
subjects merit different courses of study. 

Quantum Shannon theory intersects two of the great sciences of the twentieth century: the 
quantum theory and information theory. It was really only a matter of time before physicists, 
mathematicians, computer scientists, and engineers began to consider the convergence of the 
two subjects because the quantum theory was essentially established by 1926 and information 
theory by 1948. This convergence has sparked what we may call the "quantum information 
revolution" or what some refer to as the "second quantum revolution" [81] (with the first 
one being the discovery of the quantum theory) . 

The fundamental components of the quantum theory are a set of postulates that govern 
phenomena on the scale of atoms. Uncertainty is at the heart of the quantum theory — 
"quantum uncertainty" or "Heisenberg uncertainty" is not due to our lack or loss of informa- 
tion or due to imprecise measurement capability, but rather, it is a fundamental uncertainty 
inherent in nature itself. The discovery of the quantum theory came about as a total shock 
to the physics community, shaking the foundations of scientific knowledge. Perhaps it is for 
this reason that every introductory quantum mechanics course delves into its history in detail 
and celebrates the founding fathers of the quantum theory. In this book, we do not discuss 
the history of the quantum theory in much detail and instead refer to several great intro- 
ductory books for these details |95ll4"Tl 12091 II 16j . Physicists such as Planck, Einstein, Bohr, 
de Broglie, Born, Heisenberg, Schrodinger, Pauli, Dirac, and von Neumann contributed to 
the foundations of the quantum theory in the 1920s and 1930s. We introduce the quantum 
theory by briefly commenting on its history and major underlying concepts. 

Information theory is the second great foundational science for quantum Shannon theory. 
In some sense, it is merely an application of probability theory. Its aim is to quantify the 

©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



19 



ultimate compressibility of information and the ultimate ability for a sender to transmit 
information reliably to a receiver. It relies upon probability theory because a "classical" 
uncertainty, arising from our lack of total information about any given scenario, is ubiquitous 
throughout all information processing tasks. The uncertainty in classical information theory 
is the kind that is present in the flipping of a coin or the shuffle of a deck of cards, the 
uncertainty due to imprecise knowledge. "Quantum" uncertainty is inherent in nature itself 
and is perhaps not as intuitive as the uncertainty that classical information theory measures. 
We later expand further on these differing kinds of uncertainty, and Chapter [1] shows how a 
theory of quantum information captures both kinds of uncertainty within one formalismjj 

The history of classical information theory began with Claude Shannon. Shannon's 
contribution is heralded as one of the single greatest contributions to modern science because 
he established the entire field in his seminal 1948 paper [222j. In this paper, he coined the 
essential terminology, and he stated and justified the main mathematical definitions and 
the two fundamental theorems of information theory. Many successors have contributed to 
information theory, but most, if not all, of the follow-up contributions employ Shannon's 
line of thinking in some form. In quantum Shannon theory, we will notice that many of 
Shannon's original ideas are present, though they take a particular "quantum" form. 

One of the major assumptions in both classical information theory and quantum Shannon 
theory is that local computation is free but communication is expensive. In particular, for 
the classical case, we assume that each party has unbounded computation available. For the 
quantum case, we assume that each party has a fault-tolerant quantum computer available 
at his or her local station and the power of each quantum computer is unbounded. We also 
assume that both communication and a shared resource are expensive, and for this reason, 
we keep track of these resources in a resource count. Though sometimes, we might say that 
classical communication is free in order to simplify a scenario. A simplification like this one 
can lead to greater insights that would not be possible without making such an assumption. 

We should first study and understand the postulates of the quantum theory in order to 
study quantum Shannon theory properly. Your heart may sink when you learn that the 
Nobel-prize winning physicist Richard Feynman is famously quoted as saying, "I think I 
can safely say that nobody understands quantum mechanics." We should clarify Feynman's 
statement. Of course, Feynman does not intend to suggest that no one knows how to work 
with the quantum theory. Many well-abled physicists are employed to spend their days 
exploiting the laws of the quantum theory to do fantastic things, such as the trapping of 
ions in a vacuum or applying the quantum tunneling effect in a transistor to process a single 
electron. I am hoping that you will give me the license to interpret Feynman's statement. 
I think he means that it is impossible for us to understand the quantum theory intuitively 
because we do not experience the phenomena that it predicts. If we were the size of atoms 
and we experienced the laws of quantum theory on a daily basis, then perhaps the quantum 
theory would be as intuitive to us as Newton's law of universal gravitationjj Thus, in this 



Von Neumann established the density matrix formalism in his 1932 book on the quantum theory. This 
mathematical framework captures both kinds of uncertainty [243 . 

3 Of course, Newton's law of universal gravitation was a revolutionary breakthrough because the phe- 
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sense, I would agree with Feynman — nobody can really understand the quantum theory 
because it is not part of our every day experiences. Nevertheless, our aim in this book is 
to work with the laws of quantum theory so that we may begin to gather insights about 
what the theory predicts. Only by exposure to and practice with its postulates can we really 
gain an intuition for its predictions. It is best to imagine that the world in our every day 
life does incorporate the postulates of quantum mechanics, because, indeed, as many, many 
experiments have confirmed, it does! 

We delve into the history of the convergence of the quantum theory and information 
theory in some detail in this introductory chapter because this convergence does have an 
interesting history and is relevant to the topic of this book. The purpose of this historical 
review is not only to become familiar with the field itself but also to glimpse into the minds 
of the founders of the field so that we may see the types of questions that are important to 
think about when tackling new, unsolved problems \j Many of the most important results 
come about from asking simple, yet profound, questions and exploring the possibilities. 

We first briefly review the history and the fundamental concepts of the quantum theory 
before delving into the convergence of the quantum theory and information theory. We build 
on these discussions by introducing some of the initial fundamental contributions to quantum 
Shannon theory. The final part of this chapter ends by posing some of the questions to which 
quantum Shannon theory provides answers. 



1.1 Overview of the Quantum Theory 

1.1.1 Brief History of the Quantum Theory 

A physicist living around 1890 would have been well pleased with the progress of physics, 
but perhaps frustrated at the seeming lack of open research problems. It seemed as though 
the Newtonian laws of mechanics, Maxwell's theory of electromagnetism, and Boltzmann's 
theory of statistical mechanics explained most natural phenomena. In fact, Max Planck, one 
of the founding fathers of the quantum theory, was searching for an area of study in 1874 
and his advisor gave him the following guidance: 

"In this field [of physics], almost everything is already discovered, and all that 
remains is to fill a few holes." 



nomenon of gravity is not entirely intuitive when a student first learns it. But, we do experience the 
gravitational law in our daily lives and I would argue that this phenomenon is much more intuitive than, 
say, the phenomenon of quantum entanglement. 

4 Another way to discover good questions is to attend parties that well-established professors hold. The 
story goes that Oxford physicist David Deutsch attended a 1981 party at the Austin, Texas house of re- 
knowned physicist John Archibald Wheeler, in which many attendees discussed the foundations of computing 
[193] . Deutsch claims that he could immediately see that the quantum theory would give an improvement 
for computation. Not quite immediately later, he published an algorithm in 1985 that was the first instance 
of a quantum speedup over the fastest classical algorithm [67 . 
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Two Clouds 



Fortunately for the quantum theory, Planck did not heed this advice and instead began his 
physics studies. Not everyone agreed with Planck's former advisor. Lord Kelvin stated in 
his famous April 1900 lecture that "two clouds" surrounded the "beauty and clearness of 
theory" [Tj. The first cloud was the failure of Michelson and Morley to detect a change in the 
speed of light as predicted by an "ether theory," and the second cloud was the ultraviolet 
catastrophe, the prediction of classical theory that a blackbody emits radiation with an 
infinite intensity at high ultraviolet frequencies. In that same year of 1900, Planck started 
the quantum revolution that began to clear the second cloud. He assumed that light comes 
in discrete bundles of energy and used this idea to produce a formula that correctly predicts 
the spectrum of blackbody radiation [205j . A great cartoon lampoon of the ultraviolet 
catastrophe shows Planck calmly sitting fireside with a classical physicist whose face is 
burning to bits because of the intense ultraviolet radiation that his classical theory predicts 
the fire is emitting |190j . A few years later, in 1905, Einstein contributed a paper that 
helped to further clear the second cloud [86] (he also cleared the first cloud with his other 
1905 paper on special relativity). He assumed that Planck was right and showed that the 
postulate that light arrives in "quanta" (now known as the photon theory) provides a simple 
explanation for the photoelectric effect, the phenomenon in which electromagnetic radiation 
beyond a certain threshold frequency impinging on a metallic surface induces a current in 
that metal. 

These two explanations of Planck and Einstein fueled a theoretical revolution in physics 
that some now call the first quantum revolution [81 J. Some years later, in 1924, Louis de 
Broglie postulated that every individual element of matter, whether an atom, electron, or 
photon, has both particle-like behavior and wave-like behavior [66J. Just two years later, 
Erwin Schrodinger used the de Broglie idea to formulate a wave equation, now known as 
Schrodinger's equation, that governs the evolution of a closed quantum-mechanical system 
[214] . His formalism later became known as wave mechanics and was popular among physi- 
cists because it appealed to notions with which they were already familiar. Meanwhile, in 
1925, Werner Heisenberg formulated an "alternate" quantum theory called matrix mechanics 
[137] . His theory used matrices and theorems from linear algebra, mathematics with which 
physicists at the time were not readily familiar. For this reason, Schrodinger's wave mechan- 
ics was more popular than Heisenberg's matrix mechanics. In 1930, Paul Dirac published 
a textbook (now in its fourth edition and reprinted 16 times) that unified the formalisms 
of Schrodinger and Heisenberg, showing that they were actually equivalent [79]. In a later 
edition, he introduced the now ubiquitous "Dirac notation" for quantum theory that we will 
employ in this book. 

After the publication of Dirac's textbook, the quantum theory then stood on firm math- 
ematical grounding and the basic theory had been established. We thus end our historical 
overview at this point and move on to the fundamental concepts of the quantum theory. 
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1.1.2 Fundamental Concepts of the Quantum Theory 

Quantum theory, as applied in quantum information theory, really has only a few important 
concepts. We review each of these aspects of quantum theory briefly in this section. Some 
of these phenomena are uniquely "quantum" but others do occur in the classical theory. In 
short, these concepts are as followqj 

1. Indeterminism 

2. Interference 

3. Uncertainty 

4. Superposition 

5. Entanglement 

The quantum theory is indeterministic because the theory makes predictions about prob- 
abilities of events only. This aspect of quantum theory is in contrast with a deterministic 
classical theory such as that predicted by the Newtonian laws. In the Newtonian system, it 
is possible to predict, with certainty, the trajectories of all objects involved in an interaction 
if one knows only the initial positions and velocities of all the objects. This deterministic 
view of reality even led some to believe in determinism from a philosophical point of view. 
For instance, the mathematician Pierre-Simon Laplace once stated that a supreme intellect, 
colloquially known as Laplace's demon, could predict all future events from present and past 
events: 

"We may regard the present state of the universe as the effect of its past and 
the cause of its future. An intellect which at a certain moment would know all 
forces that set nature in motion, and all positions of all items of which nature is 
composed, if this intellect were also vast enough to submit these data to analysis, 
it would embrace in a single formula the movements of the greatest bodies of the 
universe and those of the tiniest atom; for such an intellect nothing would be 
uncertain and the future just like the past would be present before its eyes." 

The application of Laplace's statement to atoms is fundamentally incorrect, but we can 
forgive him because the quantum theory had not yet been established in his time. Many 
have extrapolated from Laplace's statement to argue the invalidity of human free will. We 
leave such debates to philosophers jj 

In reality, we never can possess full information about the positions and velocities of 
every object in any given physical system. Incorporating probability theory then allows 
us to make predictions about the probabilities of events and, with some modifications, the 



5 I have used Todd A. Brun's list from his lecture notes |49j . 

6 John Archibald Wheeler may disagree with this approach. He once said, "Philosophy is too important 
to be left to the philosophers" [191] . 
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classical theory becomes an indeterministic theory. Thus, indeterminism is not a unique 
aspect of the quantum theory but merely a feature of it. But this feature is so crucial to the 
quantum theory that we list it among the fundamental concepts. 

Interference is another feature of the quantum theory. It is also present in any classical 
wave theory — constructive interference occurs when the crest of one wave meets the crest of 
another, producing a stronger wave, while destructive interference occurs when the crest of 
one wave meets the trough of another, canceling out each other. In any classical wave theory, 
a wave occurs as a result of many particles in a particular medium coherently displacing one 
another, as in an ocean surface wave or a sound pressure wave, or as a result of coherent 
oscillating electric and magnetic fields, as in an electromagnetic wave. The strange aspect 
of interference in the quantum theory is that even a single "particle" such as an electron 
can exhibit wave-like features, as in the famous double slit experiment (see Ref. |115j for a 
history of these experiments). This quantum interference is what contributes wave-particle 
duality to every fundamental component of matter. 

Uncertainty is at the heart of the quantum theory. Uncertainty in the quantum theory 
is fundamentally different from uncertainty in the classical theory (discussed in the former 
paragraph about an indeterministic classical theory). The archetypal example of uncertainty 
in the quantum theory occurs for a single particle. This particle has two complementary vari- 
ables: its position and its momentum. The uncertainty principle states that it is impossible 
to know both its position and momentum precisely. This principle even calls into question 
the meaning of the word "know" in the previous sentence in the context of quantum the- 
ory. We might say that we can only know that which we measure, and thus, we can only 
know the position of a particle after performing a precise measurement that determines it. 
If we follow with a precise measurement of its momentum, we lose all information about the 
position of the particle after learning its momentum. In quantum information science, the 
BB84 protocol for quantum key distribution exploits the uncertainty principle and statistical 
analysis to determine the presence of an eavesdropper on a quantum communication channel 
by encoding information into two complementary variables 



The superposition principle states that a quantum particle can be in a linear combination 
state, or superposed state, of any two other allowable states. This principle is a result of the 
linearity of quantum theory. Schrodinger's wave equation is a linear differential equation, 
meaning that the linear combination aift + Pcj) is a solution of the equation if ip and (f) are both 
solutions of the equation. We say that the solution aifi+0(j) is a coherent superposition of the 
two solutions. The superposition principle has dramatic consequences for the interpretation 
of the quantum theory — it gives rise to the notion that a particle can somehow "be in one 
location and another" at the same time. There are different interpretations of the meaning 
of the superposition principle, but we do not highlight them here. We merely choose to 
use the technical language that the particle is in a superposition of both locations. The 
loss of a superposition can occur through the interaction of a particle with its environment. 
Maintaining an arbitrary superposition of quantum states is one of the central goals of a 
quantum communication protocol. 
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The last, and perhaps most striking, "quantum" feature that we highlight here is entangle- 
ment. There is no true classical analog of entanglement. The closest analog of entanglement 
might be a secret key that two parties possess, but even this analogy does not come close. 
Entanglement refers to the strong quantum correlations that two or more quantum parti- 
cles can possess. The correlations in quantum entanglement are stronger than any classical 
correlations in a precise, technical sense. Schrodinger first coined the term "entanglement" 
after observing some of its strange properties and consequences [215] . Einstein, Podolsky, 
and Rosen then presented an apparent paradox involving entanglement that raised concerns 
over the completeness of the quantum theory [87]. That is, they suggested that the seem- 
ingly strange properties of entanglement called the uncertainty principle into question (and 
thus the completeness of the quantum theory) and furthermore suggested that there might 
be some "local hidden-variable" theory that could explain the results of experiments. It 
took about thirty years to resolve this paradox, but John Bell did so by presenting a simple 
inequality, now known as a Bell inequality [17]. He showed that any two-particle classical 
correlations that satisfy the assumptions of the "local hidden-variable theory" of Einstein, 
Podolsky, and Rosen must be less than a certain amount. He then showed how the correla- 
tions of two entangled quantum particles can violate this inequality, and thus, entanglement 
has no explanation in terms of classical correlations but is instead a uniquely quantum phe- 
nomenon. Experimentalists later verified that two entangled quantum particles can violate 
Bell's inequality [TO] . 

In quantum information science, the non-classical correlations in entanglement play a 
fundamental role in many protocols. For example, entanglement is the enabling resource in 
teleportation, a protocol that disembodies a quantum state in one location and reproduces 
it in another. We will see many other examples of exploiting entanglement throughout this 
book. 

Entanglement theory concerns different methods for quantifying the amount of entangle- 
ment present not only in a two-particle state but also in a multiparticle state. A large body 
of literature exists that investigates entanglement theory [155] , but we only address aspects 
of entanglement that are relevant in our study of quantum Shannon theory. 

The above five features capture the essence of the quantum theory, but we will see more 
aspects of it as we progress through our overview in Chapters [3j |1[ and [5} 



1.2 The Emergence of Quantum Shannon Theory 



In the previous section, we discussed several unique quantum phenomena such as superpo- 
sition and entanglement, but it is not clear what kind of information these unique quantum 
phenomena represent. Is it possible to find a convergence of the quantum theory and Shan- 
non's information theory, and if so, what is the convergence? 
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1.2.1 The Shannon Information Bit 

A fundamental contribution of Shannon is the notion of a bit as a measure of information. 
Typically, when we think of a bit, we think of a two- valued quantity that can be in the state 
'off' or the state 'on.' We represent this bit with a binary number that can be '0' or '1.' 
We also associate a physical representation with a bit — this physical representation can be 
whether a light switch is off or on, whether a transistor allows current to flow or does not, 
whether a large number of magnetic spins point in one direction or another, the list going 
on and on. These are all physical notions of a bit. 

Shannon's notion of a bit is quite different from these physical notions, and we motivate 
his notion with the example of a fair coin. Without flipping the coin, we have no idea what 
the result of a coin flip will be — our best guess at the result is to guess randomly. If someone 
else learns the result of a random coin flip, we can ask this person the question: What was 
the result? We then learn one bit of information. 

Though it may seem obvious, it is important to stress that we do not learn any or 
not as much information if we do not ask the right question. This point becomes even more 
important in the quantum case. Suppose that the coin is not fair — without loss of generality, 
suppose the probability of "heads" is greater than the probability of "tails." In this case, we 
would not be as surprised to learn that the result of a coin flip is "heads." We may say in 
this case that we learn less than one bit of information if we were to ask someone the result 
of the coin flip. 

The Shannon binary entropy is a measure of information. Given a probability distribution 
(p, 1 — p) for a binary random variable, its Shannon binary entropy is 

H 2 (p) = -plogp - (1 - p) log(l - p), (1.1) 

where the logarithm is base two. The Shannon binary entropy measures information in units 



of bits. We will discuss it in more detail in the next chapter and in Chapter [TO 

The Shannon bit, or Shannon binary entropy, is a measure of the surprise upon learning 
the outcome of a random binary experiment. Thus, the Shannon bit has a completely 
different interpretation from that of the physical bit. The outcome of the coin flip resides in 
a physical bit, but it is the information associated with the random nature of the physical 
bit that we would like to measure. It is this notion of bit that is important in information 
theory. 

1.2.2 A Measure of Quantum Information 

The above section discusses Shannon's notion of a bit as a measure of information. A natural 
question is whether there is an analogous measure of quantum information, but before we 
can even ask that question, we might first wonder: what is quantum information? As in the 
classical case, there is a physical notion of quantum information. A quantum state always 
resides "in" a physical system. Perhaps another way of stating this idea is that every physical 
system is in some quantum state. The physical notion of a quantum bit, or qubit for short 
(pronounced "cue • bit"), is a two- level quantum system. Examples of two-level quantum 

©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



26 CHAPTER 1. CONCEPTS IN QUANTUM SHANNON THEORY 



systems are the spin of the electron, the polarization of a photon, or an atom with a ground 
state and an excited state. The physical notion of a qubit is straightforward to understand 
once we have a grasp of the quantum theory. 

A more pressing question for us in this book is to understand an informational notion 
of a qubit, as in the Shannon sense. In the classical case, we quantify information by the 
amount of knowledge we gain after learning the answer to a probabilistic question. In the 
quantum world, what knowledge can we have of a quantum state? 

Sometimes we may know the exact quantum state of a physical system because we pre- 
pared the quantum system in a certain way. For example, we may prepare an electron in 
its "spin-up in the z direction" state, where || z ) denotes this state. If we prepare the state 
in this way, we know for certain that the state is indeed || z ) and no other state. Thus, we 
do not gain any information, or equivalently, there is no removal of uncertainty if someone 
else tells us that the state is |tz). We may say that this state has zero qubits of quantum 
information, where the term "qubit" now refers to a measure of the quantum information of 
a state. 

In the quantum world, we also have the option of measuring this state in the x direction. 
The postulates of quantum theory, given in Chapter [3j predict that the state will then be 
It *) or \ix) with equal probability after measuring in the x direction. One interpretation 
of this aspect of quantum theory is that the system does not have any definite state in 
the x direction, in fact there is maximal uncertainty about its x direction, if we know 
that the physical system has a definite z direction. This behavior is one manifestation of 
the Heisenberg uncertainty principle. So before performing the measurement, we have no 
knowledge of the resulting state and we gain one Shannon bit of information after learning 
the result of the measurement. If we use Shannon's notion of entropy and perform an x 
measurement, this classical measure loses some of its capability here to capture our knowledge 
of the state of the system. It is inadequate to capture our knowledge of the state because 
we actually prepared it ourselves and know with certainty that it is in the state || z ). With 
these different notions of information gain, which one is the most appropriate for the quantum 
case? 

It turns out that the first way of thinking is the one that is most useful for quantifying 
quantum information. If someone tells us the quantum state of a particular physical system 
and this state is indeed the true state, then we have complete knowledge of the state and 
thus do not learn more "qubits" of quantum information from this point onward. This line of 
thinking is perhaps similar in one sense to the classical world, but different from the classical 
world, in the sense of the case presented in the previous paragraph. 

Now suppose that a friend, let us call him "Bob," randomly prepares quantum states as a 
probabilistic ensemble. Suppose Bob prepares |T 2 ) or \[ z ) with equal probability. With only 
this probabilistic knowledge, we acquire one bit of information if Bob reveals which state 
he prepared. We could also perform a quantum measurement on the system to determine 
what state Bob prepared (we discuss quantum measurements in detail in Chapter [3]). One 
reasonable measurement to perform is a measurement in the z direction. The result of 
the measurement determines which state Bob actually prepared because both states in the 
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ensembles are states with definite z direction. The result of this measurement thus gives 
us one bit of information — the same amount that we would learn if Bob informed us which 
state he prepared. It seems that most of this logic is similar to the classical case — i.e., the 
result of the measurement only gave us one Shannon bit of information. 

Another measurement to perform is a measurement in the x direction. If the actual 
state prepared is ||z), then the quantum theory predicts that the state becomes | j x ) or |j x ) 
with equal probability. Similarly, if the actual state prepared is \i z ), then the quantum 
theory predicts that the state again becomes | tx) or \ix) with equal probability. Calculating 
probabilities, the resulting state is | j x ) with probability 1/2 and |j x ) with probability 1/2. 
So the Shannon bit content of learning the result is again one bit, but we arrived at this 
conclusion in a much different fashion from the scenario where we measured in the z direction. 
How can we quantify the quantum information of this ensemble? We claim for now that 
this ensemble contains one qubit of quantum information and this result derives from either 
the measurement in the z direction or the measurement in the x direction for this particular 
ensemble. 

Let us consider one final example that perhaps gives more insight into how we might 
quantify quantum information. Suppose Bob prepares || z ) or |f x ) with equal probability. 
The first state is spin-up in the z direction and the second is spin-up in the x direction. If Bob 
reveals which state he prepared, then we learn one Shannon bit of information. But suppose 
now that we would like to learn the prepared state on our own, without the help of our friend 
Bob. One possibility is to perform a measurement in the z direction. If the state prepared is 
Ifz), then we learn this result with probability 1/2. But if the state prepared is | j x ), then the 
quantum theory predicts that the state becomes | j z ) or | [ z ) with equal probability (while we 
learn what the new state is). Thus, quantum theory predicts that the act of measuring this 
ensemble inevitably disturbs the state some of the time. Also, there is no way that we can 
learn with certainty whether the prepared state is \] z ) or \] x ). Using a measurement in the 
z direction, the resulting state is | j 2 ) with probability 3/4 and \{ z ) with probability 1/4. We 
learn less than one Shannon bit of information from this ensemble because the probability 
distribution becomes skewed when we perform this particular measurement. 

The probabilities resulting from the measurement in the z direction are the same that 
would result from an ensemble where Bob prepares \] z ) with probability 3/4 and |J. Z ) with 
probability 1/4 and we perform a measurement in the z direction. The actual Shannon 
entropy of the distribution (3/4, 1/4) is about 0.81 bits, confirming our intuition that we 
learn approximately less than one bit. A similar, symmetric analysis holds to show that we 
gain 0.81 bits of information when we perform a measurement in the x direction. 

We have more knowledge of the system in question if we gain less information from 
performing measurements on it. In the quantum theory, we learn less about a system if we 
perform a measurement on it that does not disturb it too much. Is there a measurement that 
we can perform in which we learn the least amount of information? Recall that learning the 
least amount of information is ideal because it has the interpretation that we require fewer 
questions on average to learn the result of a random experiment. Indeed, it turns out that a 
measurement in the x + z direction reveals the least amount of information. Avoiding details 
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for now, this measurement returns a state that we label |tx-+ 2 ) with probability cos 2 (7r/8) 
and a state |J. x +z) with probability sin 2 (7r/8). This measurement has the desirable effect 
that it causes the least amount of disturbance to the original states in the ensemble. The 
entropy of the distribution resulting from the measurement is about 0.6 bits and is less than 
the one bit that we learn if Bob reveals the state. The entropy 0.6 is also the least amount of 
information among all possible sharp measurements that we may perform on the ensemble. 
We claim that this ensemble contains 0.6 qubits of quantum information. 

We can determine the ultimate compressibility of classical data with Shannon's source 
coding theorem (we overview this technique in the next chapter). Is there a similar way that 
we can determine the ultimate compressibility of quantum information? This question was 
one of the early and profitable ones for quantum Shannon theory and the answer is affir- 
mative. The technique for quantum compression is called Schumacher compression, named 
after Benjamin Schumacher. Schumacher used ideas similar to that of Shannon — he created 
the notion of a quantum information source that emits random physical qubits, and he in- 
voked the law of large numbers to show that there is a so-called typical subspace where most 
of the quantum information really resides. This line of thought is similar to that which we 
will discuss in the overview of data compression in the next chapter. The size of the typical 
subspace for most quantum information sources is exponentially smaller than the size of the 
space in which the emitted physical qubits resides. Thus, one can "quantum compress" the 
quantum information to this subspace without losing much. Schumacher's quantum source 
coding theorem then quantifies, in an operational sense, the amount of actual quantum infor- 
mation that the ensemble contains. The amount of actual quantum information corresponds 
to the number of qubits, in the informational sense, that the ensemble contains. It is this 
measure that is equivalent to the "optimal measurement" one that we suggested in the previ- 
ous paragraph. We will study this idea in more detail later when we introduce the quantum 
theory and a rigorous notion of a quantum information source. 

The techniques of quantum Shannon theory are the direct quantum analog of the tech- 
niques from classical information theory. We use the law of large numbers and the notion 
of the typical subspace, but we require generalizations of measures from the classical world 
to determine how "close" two different quantum states are. One measure, the fidelity, has 
the operational interpretation that it gives the probability that one quantum state would 
pass a test for being another. The trace distance is another distance measure that is per- 
haps more similar to a classical distance measure — its classical analog is a measure of the 
closeness of two probability distributions. The techniques in quantum Shannon theory also 
reside firmly in the quantum theory and have no true classical analog for some cases. Some 
of the techniques will seem similar to those in the classical world, but the answer to some 
of the fundamental questions in quantum Shannon theory are far different from some of the 
answers in the classical world. It is the purpose of this book to explore the answers to the 
fundamental questions of quantum Shannon theory, and we now begin to ask what kinds of 
tasks we can perform. 

©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



1.2. THE EMERGENCE OF QUANTUM SHANNON THEORY 29 



1.2.3 Operational Tasks in Quantum Shannon Theory 

Quantum Shannon theory has several resources that two parties can exploit in a quantum 
information processing task. Perhaps the most natural quantum resource is a noiseless qubit 
channel. We can think of this resource as some medium through which a physical qubit can 
travel without being affected by any noise. One example of a noiseless qubit channel could 
be the free space through which a photon travels, where it ideally does not interact with any 
other particles along the way to its destination]^] 

A noiseless classical bit channel is a special case of a noiseless qubit channel because we 
can always encode classical information into quantum states. For the example of a photon, we 
can say that horizontal polarization corresponds to a '0' and vertical polarization corresponds 
to a '1'. We refer to the dynamic resource of a noiseless classical bit channel as a cbit, in 
order to distinguish it from the noiseless qubit channel. 

Perhaps the most intriguing resource that two parties can share is noiseless entanglement. 
Any entanglement resource is a static resource because it is one that they share. Examples 
of static resources in the classical world are an information source that we would like to 
compress or a common secret key that two parties may possess. We actually have a way 
of measuring entanglement that we discuss later on, and for this reason, we can say that a 
sender and receiver have bits of entanglement or ebits. 

Entanglement turns out to be a useful resource in many quantum communication tasks. 
One example where it is useful is in the teleportation protocol, where a sender and receiver 
use one ebit and two classical bits to transmit one qubit faithfully. This protocol is an 
example of the extraordinary power of noiseless entanglement. The name "teleportation" 
is really appropriate for this protocol because the physical qubit vanishes from the sender's 
station and appears at the receiver's station after the receiver obtains the two transmitted 
classical bits. We will see later on that a noiseless qubit channel can generate the other 
two noiseless resources, but it is impossible for each of the other two noiseless resources to 
generate the noiseless qubit channel. In this sense, the noiseless qubit channel is the strongest 
of the three unit resources. 

The first quantum information processing task that we have discussed is Schumacher 
compression. The goal of this task is to use as few noiseless qubit channels as possible in 
order to transmit the output of a quantum information source reliably. After we understand 
Schumacher compression in a technical sense, the main focus of this book is to determine 
what quantum information processing tasks a sender and receiver can accomplish with the 
use of a noisy quantum channel. The first and perhaps simplest task is to determine how 
much classical information a sender can transmit reliably to a receiver, by using a noisy quan- 
tum channel a large number of times. This task is known as HSW coding, named after its 
discoverers Holevo, Schumacher, and Westmoreland. The HSW coding theorem is one quan- 
tum generalization of Shannon's channel coding theorem (overviewed in the next chapter). 
We can also assume that a sender and receiver share some amount of noiseless entanglement 
prior to communication. They can then use this noiseless entanglement in addition to a large 



We should be careful to note here that this is not actually a perfect channel because even empty space 
can be noisy in quantum mechanics, but nevertheless, it is a simple physical example to imagine. 
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number of uses of a noisy quantum channel. This task is known as entanglement- assisted 
classical communication over a noisy quantum channel. The capacity theorem corresponding 
to this task again highlights one of the marvelous features of entanglement. It shows that en- 
tanglement gives a boost to the amount of noiseless classical communication we can generate 
with a noisy quantum channel — the classical capacity is generally higher with entanglement 
assistance than without it. 

Perhaps the most important theorem for quantum Shannon theory is the quantum chan- 
nel capacity theorem. Any proof of a capacity theorem consists of two parts: one part 
establishes a lower bound on the capacity and the other part establishes an upper bound. If 
the two bounds coincide, then we have a characterization of the capacity in terms of these 
bounds. The lower bound on the quantum capacity is colloquially known as the LSD coding 
theoremjjand it gives a characterization of the highest rate at which a sender can transmit 
quantum information reliably over a noisy quantum channel so that a receiver can recover it 
perfectly. The rate is generally lower than the classical capacity because it is more difficult 
to keep quantum information intact. As we have said before, it is possible to encode clas- 
sical information into quantum states, but this classical encoding is only a special case of a 
quantum state. In order to preserve quantum information, we have to be able to preserve 
arbitrary quantum states, not merely a classical encoding within a quantum state. 



The pinnacle of this book is in Chapter 23 where we finally reach our study of the quan- 



tum capacity theorem. All efforts and technical developments in preceding chapters have this 
goal in mindjj Our first coding theorem in the dynamic setting is the HSW coding theorem. 
A rigorous study of this coding theorem lays an important foundation — an understanding 
of a coding structure for a noisy quantum channel. The method for the HSW coding theo- 
rem applies to the "entanglement-assisted classical capacity theorem," which is one building 
block for other protocols in quantum Shannon theory. We then build a more complex coding 
structure for sending private classical information over a noisy quantum channel. In private 
coding, we are concerned with coding in such a way that the intended receiver can learn the 
transmitted message perfectly, but a third party eavesdropper cannot learn anything about 
what the sender transmits to the intended receiver. This study of the private classical capac- 
ity may seem like a detour at first, but it is closely linked with our ultimate aim. The coding 
structure developed for sending private information proves to be indispensable for under- 
standing the structure of a quantum code. There are strong connections between the goals 
of keeping classical information private and keeping quantum information coherent. In the 
private coding scenario, the goal is to avoid leaking any information to an eavesdropper so 
that she cannot learn anything about the transmission. In the quantum coding scenario, we 
can think of quantum noise as resulting from the environment learning about the transmit- 
ted quantum information and this act of learning disturbs the quantum information. This 



8 The LSD coding theorem does not refer to the synthetic crystalline compound, lysergic acid diethylamide 
(which one may potentially use as a hallucinogenic drug), but refers rather to Seth Lloyd |185| . Peter Shor 
[227] . and Igor Devetak |68j . all of whom gave separate proofs of the lower bound on the quantum capacity 
with increasing standards of rigor. 

9 One goal of this book is to unravel the mathematical machinery behind Devetak's proof of the quantum 
channel coding theorem |68J. 
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effect is related to the information-disturbance trade-off that is fundamental in quantum 
information theory. If the environment learns something about the state being transmitted, 
there is inevitably some sort of noisy disturbance that affects the quantum state. Thus, we 
can see a correspondence between private coding and quantum coding. In quantum coding, 
the goal is to avoid leaking any information to the environment because the avoidance of 
such a leak implies that there is no disturbance to the transmitted state. So the role of the 
environment in quantum coding is similar to the role of the eavesdropper in private coding, 
and the goal in both scenarios is to decouple either the environment or eavesdropper from 
the picture. It is then no coincidence that private codes and quantum codes have a similar 
structure. In fact, we can say that the quantum code inherits its structure from that of the 



private code. 



10 



We also consider "trade-off" problems in addition to discussing the quantum capacity 



theorem. Chapter [21] is another high point of the book, featuring a whole host of results 
that emerge by combining several of the ideas from previous chapters. The most appealing 
aspect of this chapter is that we can construct virtually all of the protocols in quantum 



Shannon theory from just one idea in Chapter 20. Also, Chapter 21 answers many practical 



questions concerning information transmission over noisy quantum channels. Some example 
questions are as follows: 

• How much quantum and classical information can a noisy quantum channel transmit? 

• An entanglement-assisted noisy quantum channel can transmit more classical informa- 
tion than an unassisted one, but how much entanglement is really necessary? 

• Does noiseless classical communication help in transmitting quantum information re- 
liably over a noisy quantum channel? 

• How much entanglement can a noisy quantum channel generate when aided by classical 
communication? 

• How much quantum information can a noisy quantum channel communicate when 
aided by entanglement? 

These are examples of trade-off problems because they involve a noisy quantum channel 
and either the consumption or generation of a noiseless resource. For every combination 
of the generation or consumption of a noiseless resource, there is a corresponding coding 
theorem that states what rates are achievable (and in some cases optimal). Some of these 
trade-off questions admit interesting answers, but some of them do not. Our final aim in 
these trade-off questions is to determine the full triple trade-off solution where we study 
the optimal ways of combining all three unit resources (classical communication, quantum 
communication, and entanglement) with a noisy quantum channel. 



10 There are other methods of formulating quantum codes using random subspaces [227, 132, 11341 1174] . 
but we prefer the approach of Devetak because we learn about other aspects of quantum Shannon theory, 
such as the private capacity, along the way to proving the quantum capacity theorem. 
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The coding theorems for a noisy quantum channel are just as important (if not more 
important) as Shannon's classical coding theorems because they determine the ultimate ca- 
pabilities of information processing in a world where the postulates of quantum theory apply 
It is thought that quantum theory is the ultimate theory underpinning all physical phenom- 
ena and any theory of gravity will have to incorporate the quantum theory in some fashion. 
Thus, it is reasonable that we should be focusing our efforts now on a full Shannon theory of 
quantum information processing in order to determine the tasks that these systems can ac- 
complish. In many physical situations, some of the assumptions of quantum Shannon theory 
may not be justified (such as an independent and identically distributed quantum channel), 
but nevertheless, it provides an ideal setting in which we can determine the capabilities of 
these physical systems. 

1.2.4 History of Quantum Shannon Theory 

We conclude this introductory chapter by giving a brief overview of the problems that re- 
searchers were thinking about that ultimately led to the development of quantum Shannon 
theory. 

The 1970s — The first researchers in quantum information theory were concerned with 
transmitting classical data by optical means. They were ultimately led to a quantum formu- 
lation because they wanted to transmit classical information by means of a coherent laser. 
Coherent states are special quantum states that a coherent laser ideally emits. Glauber pro- 
vided a full quantum-mechanical theory of coherent states in two seminal papers |108|, 1109] , 
for which he shared the Nobel Prize in 2005 |110j . The first researchers of quantum infor- 
mation theory were Helstrom, Gordon, Stratonovich, and Holevo. Gordon first conjectured 
an important bound for our ability to access classical information from a quantum system 
[111] and Levitin stated it without proof |182j . Holevo later provided a proof that the bound 
holds J143L 1142] . This important bound is now known as the Holevo bound, and it is useful 
in proving converse theorems (theorems concerning optimality) in quantum Shannon theory. 
The simplest version of the Holevo bound states that it is not possible to transmit more 
than one classical bit of information using a noiseless qubit channel — i.e., we get one chit "per 
qubit. Helstrom developed a full theory of quantum detection and quantum estimation and 
published a textbook that discusses this theory |139j . Fannes contributed a useful continuity 
property of the entropy that is also useful in proving converse theorems in quantum Shannon 
theory (91] . Fannes was not studying quantum information theory proper, but nevertheless, 
his contribution, now known as the Fannes' inequality, is an important one for the field. 
Wiesner also used the uncertainty principle to devise a notion of "quantum money" in 1970, 
but unfortunately, his work was not accepted upon its initial submission. This work was 
way ahead of its time, and it was only until much later that it was accepted [247] . Wiesner 's 
ideas paved the way for the BB84 protocol for quantum key distribution. 

The 1980s — The 1980s witnessed only a few advances in quantum information theory 
because just a handful of researchers thought about the possibilities of linking quantum 
theory with information-theoretic ideas. The Nobel-prize winning physicist Richard Feyn- 
man published an interesting 1982 article that was one of the first to discuss computing with 
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quantum-mechanical systems [HI]. His interest was in using a quantum computer to simulate 
quantum-mechanical systems — he figured there should be a speed-up over a classical simu- 
lation if we instead use a quantum system to simulate another. This work is less quantum 
Shannon theory than it is quantum computing, but it is still a landmark because Feynman 
began to think about exploiting the actual quantum information in a physical system, rather 
than just using quantum systems to process classical information as the researchers in the 
1970s suggested. 

Wootters and Zurek produced one of the simplest, yet most profound, results that is 
crucial to quantum information science [265J (Dieks also proved this result in the same 
year [78j). They proved the no-cloning theorem, showing that the postulates of the quantum 
theory imply the impossibility of universally cloning quantum states. Given an arbitrary 
unknown quantum state, it is impossible to build a device that can copy this state. This result 
has deep implications for the processing of quantum information and shows a strong divide 
between information processing in the quantum world and that in the classical world. We will 
prove this theorem in Chapter [3] and use it time and again in our reasoning. The history of the 
no-cloning theorem is one of the more interesting "sociology of science" stories that you may 
come across. The story goes that Nick Herbert submitted a paper to Foundations of Physics 
with a proposal for faster-than-light communication using entanglement. Asher Peres was 
the referee [203J, and he knew that something had to be wrong with the proposal because it 
allowed for superluminal communication, yet he could not put his finger on what the problem 
might be (he also figured that Herbert knew his proposal was flawed). Nevertheless, Peres 
recommended the paper for publication |140] because he figured it would stimulate wide 
interest in the topic. Not much later, Wootters and Zurek published their paper, and since 
then, there have been thousands of follow-up results on the no-cloning theorem [213J. 

The work of Wiesner on conjugate coding inspired an IBM physicist named Charles 
Bennett. In 1984, Bennett and Brassard published a groundbreaking paper that detailed 
the first quantum communication protocol: the BB84 protocol [22]. This protocol shows how 
a sender and a receiver can exploit a quantum channel to establish a secret key. The security 
of this protocol relies on the uncertainty principle. If any eavesdropper tries to learn about 
the random quantum data that they use to establish the secret key, this act of learning 
inevitably disturbs the transmitted quantum data and the two parties can discover this 
disturbance by noticing the change in the statistics of random sample data. The secret key 
generation capacity of a noisy quantum channel is inextricably linked to the BB84 protocol, 
and we study this capacity problem in detail when we study the ability of quantum channels 
to communicate private information. Interestingly, the physics community largely ignored 
the BB84 paper when Bennett and Brassard first published it, likely because they presented 
it at an engineering conference and the merging of physics and information had not yet taken 
effect. 

The 1990s — The 1990s were a time of much increased activity in quantum information 
science, perhaps some of the most exciting years with many seminal results. One of the 
first major results was from Ekert. He published a different way for performing quantum 
key distribution, this time relying on the strong correlations of entanglement [88] . He was 
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unaware of the BB84 protocol when he was working on his entanglement-based quantum 
key distribution. The physics community embraced this result and shortly later, Ekert 
and Bennett and Brassard became aware of each other's respective works J24]. Bennett, 
Brassard, and Mermin later showed a sense in which these two seemingly different schemes 
are equivalent [25]. Bennett later developed the B92 protocol for quantum key distribution 
using any two non-orthogonal quantum states |18| . 

Two of the most profound results that later impacted quantum Shannon theory appeared 
in the early 1990s. First, Bennett and Wiesner devised the super-dense coding protocol |35j . 
This protocol consumes one noiseless ebit of entanglement and one noiseless qubit channel 
to simulate two noiseless classical bit channels. Let us compare this result to that of Holevo. 
Holevo's bound states that we can only send one classical bit per qubit, but the super-dense 
coding protocol states that we can double this rate if we consume entanglement as well. 
Thus, entanglement is the enabler in this protocol that boosts the classical rate beyond 
that possible with a noiseless qubit channel alone. The next year, Bennett and some other 
coauthors reversed the operations in the super-dense coding protocol to devise a protocol 
that has more profound implications. They devised the teleportation protocol (23] — this pro- 
tocol consumes two classical bit channels and one ebit to transmit a qubit from a sender 
to receiver. Right now, without any technical development yet, it may be unclear how the 
qubit gets from the sender to receiver. The original authors described it as the "disembodied 
transport of a quantum state." Suffice it for now to say that it is the unique properties of 
entanglement (in particular, the ebit) that enable this disembodied transport to occur. Yet 
again, it is entanglement that is the resource that enables this protocol, but let us be careful 
not to overstate the role of entanglement. Entanglement alone cannot do much. These pro- 
tocols show that it is the unique combination of entanglement and quantum communication 
or entanglement and classical communication that yields these results. These two noiseless 
protocols are cornerstones of quantum Shannon theory, originally suggesting that there are 
interesting ways of combining the resources of classical communication, quantum commu- 
nication, and entanglement to formulate uniquely quantum protocols and leading the way 
to more exotic protocols that combine the different noiseless resources with noisy resources. 
Simple questions concerning these protocols lead to quantum Shannon-theoretic protocols. 
In super-dense coding, how much classical information can Alice send if the quantum channel 
becomes noisy? What if the entanglement is noisy? In teleportation, how much quantum 
information can Alice send if the classical channel is noisy? What if the entanglement is 
noisy? Researchers addressed these questions quite a bit after the original super-dense cod- 
ing and teleportation protocols were available, and we address these important questions in 
this book. 

The year 1994 was a landmark for quantum information science. Shor published his algo- 
rithm that factors a number in polynomial time |223| — this algorithm gives an exponential 
speedup over the best known classical algorithm. We cannot overstate the importance of 
this algorithm for the field. Its major application is to break RSA encryption |208] because 
the security of that encryption algorithm relies on the computational difficulty of factoring a 
large number. This breakthrough generated wide interest in the idea of a quantum computer 
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and started the quest to build one and study its abilities. 

Initially, much skepticism met the idea of building a practical quantum computer [181, 
1241 j . Some experts thought that it would be impossible to overcome errors that inevitably 
occur during quantum interactions, due to the coupling of a quantum system with its envi- 
ronment. Shor met this challenge by devising the first quantum error-correcting code [224J 
and a scheme for fault-tolerant quantum computation [225J. His paper on quantum error 
correction is the one most relevant for quantum Shannon theory. At the end of this paper, 
he posed the idea of the quantum capacity of a noisy quantum channel as the highest rate 
at which a sender and receiver can maintain the fidelity of a quantum state through a large 
number of uses of the noisy channel. This open problem set the main task for researchers 
interested in quantum Shannon theory. A flurry of theoretical activity then ensued in quan- 
tum error correction J54J I235[ 11801 11121 1113[ l52l 153] and fault-tolerant quantum computation 
[6j 11731 1206[ 1175] . These two areas are now important subfields within quantum information 
science, but we do not focus on them in any detail in this book. 

Schumacher published a critical paper in 1995 as well |216] (we discussed some of his 
contributions in the previous section). This paper gave the first informational notion of a 
qubit, and it even established the now ubiquitous term "qubit." He proved the quantum 
analog of Shannon's source coding theorem, giving the ultimate compressibility of quantum 
information. He used the notion of a typical subspace as an analogy of Shannon's typical set. 
This notion of a typical subspace proves to be one of the most crucial ideas for constructing 
codes in quantum Shannon theory, just as the notion of a typical set is so crucial for Shannon's 
information theory. 

Not much later, several researchers began investigating the capacity of a noisy quantum 
channel for sending classical information |126] . Holevo |144] and Schumacher and West- 
moreland [219] independently proved that the Holevo information of a quantum channel is 
an achievable rate for classical communication over it. They appealed to Schumacher's no- 
tion of a typical subspace and constructed channel codes for sending classical information. 
The proof looks somewhat similar to the proof of Shannon's channel coding theorem (dis- 
cussed in the next chapter) after taking a few steps away from it. The proof of the converse 
theorem proceeds somewhat analogously to that of Shannon's theorem, with the exception 
that one of the steps uses Holevo's bound from 1973. It is perhaps somewhat surprising 
that it took over thirty years between the appearance of the proof of Holevo's bound (the 
main step in the converse proof) and the appearance of a direct coding theorem for sending 
classical information. 

The quantum capacity theorem is perhaps the fundamental theorem of quantum Shannon 
theory. Initial work by several researchers provided some insight into the quantum capacity 
theorem [26 ] [30 ] 1291 1220] . and a series of papers by Barnum, Knill, Nielsen, and Schumacher 
established an upper bound on the quantum capacity |217[ 12181 [TBI, IT5"] . For the lower bound, 
Lloyd was the first to construct an idea for a proof, but it turns out that his proof was 
more of a heuristic proof [185] . Shor then followed with another proof of the lower bound 
[227] . and some of Shor's ideas appeared much later in a full publication [134] . Devetak 
[68] and Cai, Winter, and Yeung [51J independently solved the private capacity theorem 
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at approximately the same time (with the publication of the CWY paper appearing a year 
after Devetak's arXiv post). Devetak took the proof of the private capacity theorem a step 
further and showed how to apply its techniques to construct a quantum code that achieves 
a good lower bound on the quantum capacity, while also providing an alternate, cleaner 
proof of the converse theorem [68J . It is Devetak's technique that we mainly explore in this 
book because it provides some insight into the coding structure (though, we also explore a 
different technique via the entanglement-assisted classical capacity theorem). 

The 2000s — In recent years, we have had many advancements in quantum Shannon 
theory (technically some of the above contributions were in the 2000s, but we did not want 
to break the continuity of the history of the quantum capacity theorem). One major result 
was the proof of the entanglement-assisted classical capacity theorem — it is the noisy version 
of the super-dense coding protocol where the quantum channel is noisy [33| l34l 1146] . This 
theorem assumes that Alice and Bob share unlimited entanglement and they exploit the 
entanglement and the noisy quantum channel to send classical information. 

A few fantastic results have arisen in recent years. Horodecki, Oppenheim, and Winter 
showed the existence of a state-merging protocol |148l ll49|. This protocol gives the minimum 
rate at which Alice and Bob consume noiseless qubit channels in order for Alice to send 
her part of a quantum state to Bob. This rate is the conditional quantum entropy — the 
protocol thus gives an operational interpretation to this entropic quantity. What was most 
fascinating about this result is that the conditional quantum entropy can be negative in 
quantum Shannon theory. Prior to their work, no one really understood what it meant for the 
conditional quantum entropy to become negative [246} 11541 [53] , but this state merging result 
gave a good operational interpretation. A negative rate implies that Alice and Bob gain the 
ability for future quantum communication, instead of consuming quantum communication 
as when the rate is positive. 

Another fantastic result came from Smith and Yard [234]. Suppose we have two noisy 
quantum channels and each of them individually has zero capacity to transmit quantum 
information. One would expect intuitively that the "joint quantum capacity" (when using 
them together) would also have zero ability to transmit quantum information. But this re- 
sult is not generally the case in the quantum world. It is possible for some particular noisy 
quantum channels with no individual quantum capacity to have a non-zero joint quantum 
capacity. It is not clear yet how we might practically take advantage of such a "superacti- 
vation" effect, but the result is nonetheless fascinating, counterintuitive, and not yet fully 
understood. 

The latter part of this decade has seen the unification of quantum Shannon theory. The 
resource inequality framework was the first step because it unified many previously known 
results into one formalism [7lT [70] . Devetak, Harrow, and Winter provided a family tree for 
quantum Shannon theory and showed how to relate the different protocols in the tree to one 
another. We will go into the theory of resource inequalities in some detail throughout this 
book because it provides a tremendous conceptual simplification when considering coding 
theorems in quantum Shannon theory. In fact, the last chapter of this book contains a concise 
summary of many of the major quantum Shannon-theoretic protocols in the language of 
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resource inequalities. Abeyesinghe, Devetak, Hayden, and Winter published a work showing 
a sense in which the mother protocol of the family tree can generate the father protocol 
[3] . We have seen unification efforts in the form of triple trade-off coding theorems (4} 1159} 
160J. These theorems give the optimal combination of classical communication, quantum 
communication, entanglement, and an asymptotic noisy resource for achieving a variety of 
quantum information processing tasks. 

We have also witnessed the emergence of a study of network quantum Shannon theory. 
Some authors have tackled the quantum broadcasting paradigm J270L 1841 1118} 1119] , where 
one sender transmits to multiple receivers. A multiple-access quantum channel has many 
senders and one receiver. Some of the same authors (and others) have tackled multiple- access 
communication |257} 1269} 1266} 1272} 12681 11561 E3] • This network quantum Shannon theory 
should become increasingly important as we get closer to the ultimate goal of a quantum 
Internet. 

Quantum Shannon theory has now established itself as an important and distinct field 
of study. The next few chapters discuss the concepts that will prepare us for tackling some 
of the major results in quantum Shannon theory. 
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CHAPTER 2 



Classical Shannon Theory 



We cannot overstate the importance of Shannon's contribution to modern science. His 
introduction of the field of information theory and his solutions to its two main theorems 
demonstrate that his ideas on communication were far beyond the other prevailing ideas in 
this domain around 1948. 

In this chapter, our aim is to discuss Shannon's two main contributions in a descriptive 
fashion. The goal of this high-level discussion is to build up the intuition for the problem 
domain of information theory and to understand the main concepts before we delve into 
the analogous quantum information-theoretic ideas. We avoid going into deep technical 
detail in this chapter, leaving such details for later chapters where we formally prove both 
classical and quantum Shannon-theoretic coding theorems. We do use some mathematics 
from probability theory, namely, the law of large numbers. 

We will be delving into the technical details of this chapter's material in later chapters 



(specifically, Chapters 10, 12, and 13). Once you have reached later chapters that develop 
some more technical details, it might be helpful to turn back to this chapter to get an overall 
flavor for the motivation of the development. 



2.1 Data Compression 



We first discuss the problem of data compression. Those who are familiar with the Internet 
have used several popular data formats as JPEG, MPEG, ZIP, GIF, etc. All of these file 
formats have corresponding algorithms for compressing the output of an information source. 
A first glance at the compression problem may lead one to believe that it is possible to 
compress the output of the information source to an arbitrarily small size, but Shannon 
proved that arbitrarily small compression is not possible. This result is the content of 
Shannon's first noiseless coding theorem. 
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2.1.1 An Example of Data Compression 

We begin with a simple example that illustrates the concept of an information source. We 
then develop a scheme for coding this source so that it requires fewer bits to represent its 
output faithfully. 

Suppose that Alice is a sender and Bob is a receiver. Suppose further that a noiseless 
bit channel connects Alice to Bob — a noiseless bit channel is one that transmits information 
perfectly from sender to receiver, e.g., Bob receives '0' if Alice transmits '0' and Bob receives 
'1' if Alice transmits '1'. Alice and Bob would like to minimize the number of times that 
they use this noiseless channel because it is expensive to use it. 

Alice would like to use the noiseless channel to communicate information to Bob. Suppose 
that an information source randomly chooses from four symbols {a, 6, c, d} and selects them 
with a skewed probability distribution: 

Pr{a} = 1/2, (2.1) 

Pr{b} = 1/8, (2.2) 

Pr{c} = 1/4, (2.3) 

Pr{d} = 1/8. (2.4) 

So it is clear that the symbol a is the most likely one, c the next likely, and both b and 
d are least likely. We make the additional assumption that the information source chooses 
each symbol independently of all previous ones and chooses each with the same probability 
distribution above. After the information source makes a selection, it gives the symbol to 
Alice for coding. 

A noiseless bit channel only accepts bits — it does not accept the symbols a, b, c, d as 
input. So, Alice has to encode her information into bits. Alice could use the following coding 
scheme: 

a ->00, 6 ^ 01, c->10, d->ll, (2.5) 

where each binary representation of a letter is a codeword. How do we measure the per- 
formance of a particular coding scheme? The expected length of a codeword is one way to 
measure performance. For the above example, the expected length is equal to two bits. This 
measure reveals a problem with the above scheme — the scheme does not take advantage of 
the skewed nature of the distribution of the information source because each codeword is the 
same length. 

One might instead consider a scheme that uses shorter codewords for symbols that are 
more likely and longer codewords for symbols that are less likelyjj Then the expected length 



1 Such coding schemes are common. Samuel F. B. Morse employed this idea in his popular Morse code. 
Also, in the movie The Diving Bell and the Butterfly, a writer becomes paralyzed with "locked-in" syndrome 
so that he can only blink his left eye. An assistant then develops a "blinking code" where she reads a list of 
letters in French, beginning with the most commonly used letter and ending with the least commonly used 
letter. The writer blinks when she says the letter he wishes and they finish an entire book with this coding 
scheme. 
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of a codeword should be shorter than the expected length of a codeword in the above scheme. 
The following coding scheme gives an improvement in the expected length of a codeword: 

a^O, 6 — ► 110, c->10, d->lll. (2.6) 

The above scheme has the advantage that any coded sequence is uniquely decodable. For 
example, suppose that Bob obtains the following sequence: 

0011010111010100010. (2.7) 

Bob can parse the above sequence as 

110 10 1110 10 10 10, (2.8) 

and determine that Alice transmitted the message 

aabcdaccaac. (2.9) 

We can calculate the expected length of this coding scheme as follows: 

^(1) + ^(3) + \(2) + ^(3) = \. (2.10) 

This scheme is thus more efficient because its expected length is 7/4 bits as opposed to two 
bits. It is a variable-length code because the number of bits in each codeword depends on 
the source symbol. 

2.1.2 A Measure of Information 

The above scheme suggests a way to measure information. Consider the probability distri- 



bution in (2.4). Would we be more surprised to learn that the information source produced 
the symbol a or to learn that it produced the symbol dl The answer is d because the source 
is less likely to produce it. One measure of the surprise of symbol x G {a, b, c, d} is 



i(x) = iog^^yj = -log(p(x)), (2.11) 

where the logarithm is base two — this convention implies the units of this measure are bits. 
This measure of surprise has the desirable property that it is higher for lower probability 
events and lower for higher probability events. Here, we take after Shannon, and we name 
i(x) the information content of the symbol x. Observe that the length of each codeword in 



the coding scheme in (2.6) is equal to the information content of its corresponding symbol. 

The information content has another desirable property called additivity. Suppose that 

the information source produces two symbols X\ and x-i- The probability for this event is 
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p(xi, x 2 ) and the joint distribution factors as pixijpix^) if we assume the source is memory- 
less — it produces each symbol independently. The information content of the two symbols 
x\ and X2 is additive because 

i(xi, x 2 ) = - log(p(xi, x 2 )) (2.12) 

= -log(p(£ 1 )p(£ 2 )) (2.13) 

= -log(p(x 1 ))-log(p(xi)) (2.14) 

= i( Xl ) + i{x 2 ). (2.15) 

In general, additivity is a desirable property for any information measure. We will return to 



the issue of additivity in many different contexts in this book (especially in Chapter 12). 
The expected information content of the information source is 

£>(£)*(£) = -^p{x)\og(p(x)). (2.16) 

X X 

The above quantity is so important in information theory that we give it a name: the entropy 
of the information source. The reason for its importance is that the entropy and variations 
of it appear as the answer to many questions in information theory. For example, in the 
above coding scheme, the expected length of a codeword is the entropy of the information 
source because 

-2 l0g 2-8 l0g 8-4 l0g 4-8 l0g 8 

= 1 -(l)+ 1 -(3)+ 1 -(2)+ 1 -(3) (2.17) 



7 
4" 



(2.18) 



It is no coincidence that we chose the particular coding scheme in (2.6). The effectiveness of 
the scheme in this example is related to the structure of the information source — the number 
of symbols is a power of two and the probability of each symbol is the reciprocal of a power 
of two. 

2.1.3 Shannon's Source Coding Theorem 

The next question to ask is whether there is any other scheme that can achieve a better 



compression rate than the scheme in (2.6). This question is the one that Shannon asked in 
his first coding theorem. To answer this question, we consider a more general information 
source and introduce a notion of Shannon, the idea of the set of typical sequences. 

We can represent a more general information source with a random variable X whose 
realizations x are letters in an alphabet X. Let px{x) be the probability mass function 
associated with random variable X, so that the probability of realization x is px{x). Let 
H(X) denote the entropy of the information source: 

H(X) = -J2px(x)log( Px (x)). (2.19) 

x€X 
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The entropy H(X) is also the entropy of the random variable A. Another way of writing it 
is H(p), but we use the more common notation H(X) throughout this book. 
The information content i(X) of random variable X is 

i(X) = -\og(p x (X)), (2.20) 

and is itself a random variable. There is nothing wrong mathematically here with having 
random variable X as the argument to the density function p x , though this expression may 
seem self-referential at a first glance. This way of thinking turns out to be useful later. 
Again, the expected information content of X is equal to the entropy: 

E x {-log(p x (X))} = H(X). (2.21) 

Exercise 2.1.1 Show that the entropy of a uniform random variable is equal to log 1^1 where 
\X\ is the size of the variable's alphabet. 

We now turn to source coding the above information source. We could associate a binary 



codeword for each symbol x as we did in the scheme in (2.6). But this scheme may lose some 
efficiency if the size of our alphabet is not a power of two or if the probabilities are not a 
reciprocal of a power of two as they are in our nice example. Shannon's breakthrough idea 
was to let the source emit a large number of realizations and then code the emitted data as 
a large block, instead of coding each symbol as the above example does. This technique is 
called block coding. Shannon's other insight was to allow for a slight error in the compression 
scheme, but show that this error vanishes as the block size becomes arbitrarily large. To make 
the block coding scheme more clear, Shannon suggests to let the source emit the following 
sequence: 

x n = xix 2 ■ ■ ■ x n , (2.22) 

where n is a large number that denotes the size of the block of emitted data and Xj, for all 
i = 1, ... ,n, denotes the i th emitted symbol. Let X n denote the random variable associated 
with the sequence x n , and let X$ be the random variable for the i th symbol Xj. Figure 2.1 
depicts Shannon's idea for a classical source code. 

The most important assumption for this information source is that it is independent and 
identically distributed (IID). The IID assumption means that each random variable A; has 
the same distribution as random variable A, and we use the index i merely to track to which 
symbol Xj the random variable A; corresponds. Under the IID assumption, the probability 
of any given emitted sequence x n factors as 

Px*{x n ) = Px 1 ,x 2 ,...,x n {xx,x 2 , ...,x n ) (2.23) 

= Px,. (x 1 )px 2 (x 2 ) ■ ■ ■ Px n (x n ) (2.24) 

= Px(xi)px(x2) ■ ■ ■ Px(x n ) (2.25) 

n 

= Y[px{xi). (2.26) 

4 = 1 
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Figure 2.1: The above figure depicts Shannon's idea for a classical source code. The information source 
emits a long sequence x n to Alice. She encodes this sequence as a block with an encoder £ and produces a 
codeword whose length is less than that of the original sequence x n (indicated by fewer lines coming out of 
the encoder £) . She transmits the codeword over noiseless bit channels (each indicated by "I" which stands 
for the identity bit channel) and Bob receives it. Bob decodes the transmitted codeword with a decoder T> 
and produces the original sequence that Alice transmitted, only if their chosen code is good, in the sense 
that the code has a small probability of error. 



The above rule from probability theory results in a remarkable simplification of the mathe- 
matics. Suppose that we now label the letters in the alphabet X as a%, . . . , a\x\ in order to 
distinguish the letters from the realizations. Let N(ai\x n ) denote the number of occurrences 
of the letter a, in the sequence x n (where i = 1, . . . , \X\). As an example, consider the 



sequence in (2.9). The quantities N(a,i\x n ) for this example are 

N(a\x n 



N(b\x n ) 
N{c\x n ) 
N(d\x n ) 



We can rewrite the result in (2.26) as 



= 5, 

= 1, 

= 4, 

= 1. 

|.Y| 



(2.27) 
(2.28) 
(2.29) 
(2.30) 



Px4* n ) = X\px{x t ) = Y[px(ai) Niailxn) (2.31) 

t=l i=l 

Keep in mind that we are allowing the length n of the emitted sequence to be extremely 
large so that it is much larger than the alphabet size l^j: 



n > \X\. 



(2.32) 



The formula on the right in (2.31) is much simpler than the the formula in (2.26) because it 



has fewer iterations of multiplications. There is a sense in which the IID assumption allows 
us to permute the sequence x n as 



x 



a,\ ■ ■ ■ oi a>2 ■ ■ ■ a.2 • • • a,\x\ • • • a>\x\i 
" — v — /v — v — ' s_l_L^_U- 

iV(ai|*») N(a 2 \x«) N (a ]x ^) 



(2.33) 
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because the probability calculation is invariant under this permutation. We introduce the 
above way of thinking right now because it turns out to be useful later when we develop 



some ideas in quantum Shannon theory (specifically in Section 13.9 Thus, the formula on 



the right in (2.31) characterizes the probability of any given sequence x n . 

The above discussion applies to a particular sequence x n that the information source 
emits. Now, we would like to analyze the behavior of a random sequence X n that the 
source emits, and this distinction between the realization x n and the random variable X n 
is important. In particular, let us consider the sample average of the information content 
of the random sequence X n (divide the information content of X n by n to get the sample 
average): 

--\og{p xn {X n )). (2.34) 

n 

It may seem strange at first glance that X n , the argument of the probability mass function 
Px n is itself a random variable, but this type of expression is perfectly well defined math- 



ematically. (This self-referencing type of expression is similar to (2.20), which we used to 
calculate the entropy.) For reasons that will become clear shortly, we call the above quantity 
the sample entropy of the random sequence X n . 

Suppose now that we use the function N(ai\») to calculate the number of appearances 
of the letter a* in the random sequence X n . We write the desired quantity as N(ai\X n ) and 
note that it is also a random variable, whose random nature derives from that of X n . We 



in (2.31): 



can reduce the expression in (2.34) to the following one with some algebra and the result 



-log(p Xn (X n )) = --log] {[pxioifM^ ] (2.35) 

-£log(^(aO W) ) (2.36) 



i=i 

= -E^^log(px(a,))- (2-37) 

*■ — ' n 

8=1 

We stress again that the above quantity is random. 

Is there any way that we can determine the behavior of the above sample entropy when n 
becomes large? Probability theory gives us a way. The expression N(ai\X n )/n represents an 
empirical distribution for the letters Oj in the alphabet X . As n becomes large, one form of 
the law of large numbers states that it is overwhelmingly likely that a random sequence has 
its empirical distribution N(ai\X n )/n close to the true distribution px(a,i), and conversely, 
it is highly unlikely that a random sequence does not satisfy this property. Thus, a random 
emitted sequence X n is highly likely to satisfy the following condition for all 5 > as n 
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becomes large: 



lim Pr< 



\X\ 



-log(p Xn (X n )) - y2px(ai)log 

n z — ' 



Pxfai) 



<S 



(2.38) 



The quantity — Yli=iPx(a>i) log(px(a>i)) is none other than the entropy H(X) so that the 
above expression is equivalent to the following one for all 5 > 0: 



lim Pr 

n— >oo 



n 



log(p xn (X n ))-H(X) 



<5 



1. 



(2.39) 



Another way of stating this property is as follows: 

It is highly likely that the information source emits a sequence whose sample 
entropy is close to the true entropy, and conversely, it is highly unlikely that the 
information source emits a sequence that does not satisfy this propertyjj 

Now we consider a particular realization x n of the random sequence X n . We name a 
particular sequence x n a typical sequence if its sample entropy is close to the true entropy 
H(X) and the set of all typical sequences is the typical set. Fortunately for data compression, 



the set of typical sequences is not too large. In Chapter [13] on typical sequences, we prove 
that the size of this set is much smaller than the set of all sequences. We accept it for now 
(and prove later) that the size of the typical set is ~ 2 nH ( x \ whereas the size of the set of 
all sequences is equal to \X\ n . We can rewrite the size of the set of all sequences as 



I %>\ n of»log|#| 



(2.40) 



Comparing the size of the typical set to the size of the set of all sequences, the typical set 
is exponentially smaller than the set of all sequences whenever the random variable is not 



equal to the uniform random variable. Figure |2.2| illustrates this concept. We summarize 
these two crucial properties of the typical set and give another that we prove later: 

Property 2.1.1 (Unit Probability) The probability that an emitted sequence is typical 
approaches unity as n becomes large. Another way of stating this property is that the typical 
set has almost all of the probability. 

Property 2.1.2 (Exponentially Small Cardinality) The size of the typical set is 2 nH( - x > 
and is exponentially smaller than the size 2 nlog ''*' of the set of all sequences whenever random 
variable X is not uniform. 



2 Do not fall into the trap of thinking "The possible sequences that the source emits are typical sequences." 
That line of reasoning is quantitatively far from the truth. In fact, what we can show is much different because 
the set of typical sequences is much smaller than the set of all possible sequences. 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



2.1. DATA COMPRESSION 



47 



nH(X) 



/ 




Set of 
All Sequences 






Typical 

Set 

V J 





n log \X\ 



Figure 2.2: The above figure indicates that the typical set is much smaller (exponentially smaller) than 
the set of all sequences. The typical set is the same size as the set of all sequences only when the entropy 
H{X) of the random variable X is equal to log|A?| — implying that the distribution of random variable X is 
uniform. 



Property 2.1.3 (Equipartition) The probability of a particular typical sequence is roughly 
uniform « 2~ nHi - x \ (The probability 2~ nH ( x " 1 is easy to calculate if we accept that the typi- 
cal set has all of the probability, its size is 2 nH<yX \ and the distribution over typical sequences 
is uniform.) 

These three properties together are collectively known as the asymptotic equipartition 
theorem. The word "asymptotic" applies because the theorem exploits the asymptotic limit 
when n is large and the word "equipartition" refers to the third property above. 

With the above notions of a typical set under our belt, a strategy for compressing infor- 
mation should now be clear. The strategy is to compress only the typical sequences that the 
source emits. We simply need to establish an invertible encoding function that maps from 
the set of typical sequences (size 2 nH ( x ' ) ) to the set of all binary strings of length nH(X) 
(this set also has size 2 nH( - x ^). If the source emits an atypical sequence, we declare an error. 
This coding scheme is reliable in the asymptotic limit because the probability of an error 
event vanishes as n becomes large, thanks to the unit probability property in the asymptotic 
equipartition theorem. We measure the rate of this block coding scheme as follows: 



compression rate 



# of noiseless channel bits 
# of source symbols 



(2.41) 



For the case of Shannon compression, the number of noiseless channel bits is equal to nH(X) 
and the number of source symbols is equal to n. Thus, the rate is H(X) and this protocol 
gives an operational interpretation to the Shannon entropy H(X) because it appears as the 
rate of data compression. 

One may then wonder whether this rate of data compression is the best that we can 
do — whether this rate is optimal (we could achieve a lower rate of compression if it were not 
optimal). In fact, the above rate is the optimal rate at which we can compress information. 



We hold off on a formal proof of optimality for now and delay it until we reach Chapter 17 
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The above discussion highlights the common approach in information theory for estab- 
lishing a coding theorem. Proving a coding theorem has two parts — traditionally called the 
direct coding theorem and the converse theorem. First, we give a coding scheme that can 
achieve a given rate for an information processing task. This first part includes a direct con- 
struction of a coding scheme, hence the name direct coding theorem. The formal statement 
of the direct coding theorem for the above task is 

"If the rate of compression is greater than the entropy of the source, then there 
exists a coding scheme that can achieve lossless data compression in the sense that 
it is possible to make the probability of error for incorrectly decoding arbitrarily 
small." 



The second task is to prove that the rate from the direct coding theorem is optimal — that 
we cannot do any better than the suggested rate. We traditionally call this part the converse 
theorem because it formally corresponds to the converse of the above statement: 

"If there exists a coding scheme that can achieve lossless data compression with 
arbitrarily small probability of decoding error, then the rate of compression is 
greater than the entropy of the source." 

The techniques used in proving each part of the coding theorem are completely different. 
For most coding theorems in information theory, we can prove the direct coding theorem by 
appealing to the ideas of typical sequences and large block sizes. That this technique gives 
a good coding scheme is directly related to the asymptotic equipartition theorem properties 
that govern the behavior of random sequences of data as the length of the sequence becomes 
large. The proof of a converse theorem relies on information inequalities that give tight 
bounds on the entropic quantities appearing in the coding constructions. We spend some 



time with information inequalities in Chapter [10] to build up our ability to prove converse 
theorems. 

Sometimes, in the course of proving a direct coding theorem, one may think to have 
found the optimal rate for a given information processing task. Without a matching converse 
theorem, it is not generally clear that the suggested rate is optimal. So, always prove converse 
theorems! 



2.2 Channel Capacity 



The next issue that we overview is the transmission of information over a noisy classical 
channel. We begin with a standard example — transmitting a single bit of information over 
a noisy bit flip channel. 
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Figure 2.3: The above figure depicts the action of the bit-flip channel. It preserves the input bit with 
probability 1 — p and flips it with probability p. 



2.2.1 An Example of an Error Correction Code 

We again have our protagonists, Alice and Bob, as respective sender and receiver. This 
time, though, we assume that a noisy classical channel connects them, so that information 
transfer is not reliable. Alice and Bob realize that a noisy channel is not as expensive as 
a noiseless one, but it still is expensive for them to use. For this reason, they would like 
to maximize the amount of information that Alice can communicate reliably to Bob, where 
reliable communication implies that there is zero probability of error for every transmission. 
The simplest example of a noisy classical channel is a bit-flip channel, with the technical 
name binary symmetric channel. This channel flips the input bit with probability p and 



leaves it unchanged with probability 1 — p. Figure ^3 depicts the action of the bit-flip 
channel. The channel behaves independently from one use to the next and behaves in the 
same random way as described above. For this reason, this channel is an independent and 
identically distributed (IID) channel. This assumption will again be important when we go 
to the asymptotic regime of a large number of uses of the channel. 

Suppose that Alice and Bob just use the channel as is — Alice just sends plain bits to 
Bob. This scheme works reliably only if the probability of bit flip error vanishes. So, Alice 
and Bob could invest their best efforts into engineering the physical channel to make it 
reliable. But, generally, it is not possible to engineer a classical channel this way for physical 
or logistical reasons. For example, Alice and Bob may only have local computers at their 
ends and may not have access to the physical channel because the telephone company may 
control the channel. 

Alice and Bob can employ a "systems engineering" solution to this problem rather than 
an engineering of the physical channel. They can redundantly encode information in a way 
such that Bob can have a higher probability of determining what Alice is sending, effectively 
reducing the level of noise on the channel. A simple example of this systems engineering 
solution is the three-bit majority vote code. Alice and Bob employ the following encoding: 

-> 000, 1 -► 111, (2.42) 

where both '000' and '111' are codewords. Alice transmits the codeword '000' with three 
independent uses of the noisy channel if she really wants to communicate a '0' to Bob 
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Channel Output 


Probability 


000 


a-pr 


001, 010, 100 


p(l-p) 2 


Oil, 110, 101 


p 2 a-p) 


111 


p 3 



Table 2.1: The first column gives the eight possible outputs of the noisy bit- flip channel when Alice encodes 
a '0' with the majority vote code. The second column gives the corresponding probability of Bob receiving 
the particular outputs. 



and she transmits the codeword '111' if she wants to send a '1' to him. The physical or 
channel bits are the actual bits that she transmits over the noisy channel, and the logical 
or information bits are those that she intends for Bob to receive. In our example, '0' is a 
logical bit and '000' corresponds to the physical bits. 

The rate of this scheme is 1/3 because it encodes one information bit. The term "rate" is 
perhaps a misnomer for coding scenarios that do not involve sending bits in a time sequence 
over a channel. We may just as well use the majority vote code to store one bit in a memory 
device that may be unreliable. Perhaps a more universal term is efficiency. Nevertheless, 
we follow convention and use the term rate throughout this book. 

Of course, the noisy bit-flip channel does not always transmit these codewords without 
error. So how does Bob decode in the case of error? He simply takes a majority vote to 
determine the transmitted message — he decodes as '0' if the number of zeros in the codeword 
he receives is greater than the number of ones. 



We now analyze the performance of this simple "systems engineering" solution. Table 2.1 



enumerates the probability of receiving every possible sequence of three bits, assuming that 
Alice transmits a '0' by encoding it as '000'. The probability of no error is (1 — p) , the prob- 
ability of a single-bit error is 3p(l — p) , the probability of a double-bit error is 3p 2 (l — p), 
and the probability of a total failure is p 3 . The majority vote solution can "correct" for no 
error and it corrects for all single-bit errors, but it has no ability to correct for double-bit 
and triple-bit errors. In fact, it actually incorrectly decodes these latter two scenarios by 
"correcting" '011', '110', or '101' to '111' and decoding '111' as a '1'. Thus, these latter two 
outcomes are errors because the code has no ability to correct them. We can employ similar 
arguments as above to the case where Alice transmits a '1' to Bob with the majority vote 
code. When does this majority vote scheme perform better than no coding at all? It is ex- 
actly when the probability of error with the majority vote code is less than p, the probability 
of error with no coding. The probability of error is equal to the following quantity: 

p(e)=p(e\0)p(0)+p(e\l)p(l). (2.43) 

Our analysis above suggests that the conditional probabilities p(e|0) and p(e|l) are equal for 
the majority vote code because of the symmetry in the noisy bit-flip channel. This result 
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implies that the probability of error is 

p(e) = 3p 2 (l-p)+p 3 (2.44) 

= 3p 2 - 2p 3 , (2.45) 

because p(0) + p(l) = 1. We consider the following inequality to determine if the majority 
vote code reduces the probability of error: 



3p 2 - 2p A < p. (2.46) 



This inequality simplifies as 



< 2p 3 - 3p 2 + p (2.47) 

.\0<p(2p-l)(p-i). (2.48) 

The only values of p that satisfy the above inequality are < p < 1/2. Thus, the majority 
vote code reduces the probability of error only when < p < 1/2, i.e., when the noise on 
the channel is not too much. Too much noise has the effect of causing the codewords to flip 
too often, throwing off Bob's decoder. 

The majority vote code gives a way for Alice and Bob to reduce the probability of error 
during their communication, but unfortunately, there is still a non-zero probability for the 
noisy channel to disrupt their communication. Is there any way that they can achieve reliable 
communication by reducing the probability of error to zero? 

One simple approach to achieve this goal is to exploit the majority vote idea a second 
time. They can concatenate two instances of the majority vote code to produce a code with 
a larger number of physical bits. Concatenation consists of using one code as an "inner" 
code and another as an "outer" code. There is no real need for us to distinguish between the 
inner and outer code in this case because we use the same code for both the inner and outer 
code. The concatenation scheme for our case first encodes the message i, where i G {0, 1}, 
using the majority vote code. Let us label the codewords as follows: 

= 000, 1 = 111. (2.49) 

For the second layer of the concatenation, we encode and I with the majority vote code 
again: 

-► 000, 1 -► 111. (2.50) 

Thus, the overall encoding of the concatenated scheme is as follows: 

0^000 000 000, 1 -► 111 111 111. (2.51) 

The rate of the concatenated code is 1/9 and smaller than the original rate of 1/3. A simple 
application of the above performance analysis for the majority vote code shows that this 
concatenation scheme reduces the probability of error as follows: 

3p 2 (e) - 2p 3 (e) = 0(p 4 ). (2.52) 
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The error probability p(e) is in (2.45) and 0(p ) indicates that the leading order term of the 
left-hand side is the fourth power in p. 

The concatenated scheme achieves a lower probability of error at the cost of using more 
physical bits in the code. Recall that our goal is to achieve reliable communication, where 
there is no probability of error. A first guess for achieving reliable communication is to 
continue concatenating. If we concatenate again, the probability of error reduces to 0(p 6 ), 
and the rate drops to 1/27. We can continue indefinitely with concatenating to make the 
probability of error arbitrarily small and achieve reliable communication, but the problem is 
that the rate approaches zero as the probability of error becomes arbitrarily small. 

The above example seems to show that there is a trade-off between the rate of the 
encoding scheme and the desired order of error probability. Is there a way that we can code 
information for a noisy channel while maintaining a good rate of communication? 

2.2.2 Shannon's Channel Coding Theorem 

Shannon's second breakthrough coding theorem provides an affirmative answer to the above 
question. This answer came as a complete shock to communication researchers in 1948. 
Furthermore, the techniques that Shannon used in demonstrating this fact were rarely used 
by engineers at the time. We give a broad overview of Shannon's main idea and techniques 
that he used to prove his second important theorem — the noisy channel coding theorem. 

2.2.3 General Model for a Channel Code 

We first generalize some of the ideas in the above example. We still have Alice trying 
to communicate with Bob, but this time, she wants to be able to transmit a larger set 
of messages with asymptotically perfect reliability, rather than merely sending '0' or '1'. 
Suppose that she selects messages from a message set [M] that consists of M messages: 

[M] = {1,...,M}. (2.53) 

Suppose furthermore that Alice chooses a particular message m with uniform probability 
from the set [M]. This assumption of a uniform distribution for Alice's messages indicates 
that we do not really care much about the content of the actual message that she is trans- 
mitting. We just assume total ignorance of her message because we only really care about 
her ability to send any message reliably. The message set [M] requires log(M) bits to rep- 
resent it, where the logarithm is again base two. This number becomes important when we 
calculate the rate of a channel code. 

The next aspect of the model that we need to generalize is the noisy channel that con- 
nects Alice to Bob. We used the bit-flip channel before, but this channel is not general 
enough for our purposes. A simple way to extend the channel model is to represent it as 
a conditional probability distribution involving an input random variable X and an output 
random variable Y: 

N: p Y \x(y\x). (2.54) 
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Figure 2.4: The above figure depicts Shannon's idea for a classical channel code. Alice chooses a message 
m with uniform probability from a message set [M] = {1, ... , M}. She encodes the message m with an 
encoding operation £. This encoding operation assigns a codeword x n to the message m and inputs the 
codeword x n to a large number of IID uses of a noisy channel J\f. The noisy channel randomly corrupts the 
codeword x n to a sequence y n . Bob receives the corrupted sequence y n and performs a decoding operation D 
to estimate the codeword x n . This estimate of the codeword x n then produces an estimate rh of the message 
that Alice transmitted. A reliable code has the property that Bob can decode each message m € [M] with 
a vanishing probability of error when the block length n becomes large. 



We use the symbol TV to represent this more general channel model. One assumption that 
we make about random variables X and Y is that they are discrete, but the respective sizes 
of their outcome sets do not have to match. The other assumption that we make concerning 
the noisy channel is that it is IID. Let X n = X 1 X 2 ■ ■ ■ X n and Y n = Y\Y 2 •••Yn be the 
random variables associated with respective sequences x n = XiX 2 ■ ■ ■ x n and y n = yiy 2 • • ■ y n - 
If Alice inputs the sequence x n to the n inputs of n respective uses of the noisy channel, a 
possible output sequence may be y n . The IID assumption allows us to factor the conditional 
probability of the output sequence y n : 



Pr^{y n \x n ) 



PY 1 \X 1 (yi\xi)p Y2 \X 2 (y2\x 2 ) ■ ■ ■ PY n \X n (Vn\x n ) 

PY\x(yi\xi)pY\x(y2\x 2 ) ■ ■ ■pY\x{y n \x n ) 

n 

X]_PY\x{Vi\Xi). 
i=l 



(2.55) 
(2.56) 

(2.57) 



The technical name of this more general channel model is a discrete memoryless channel. 

A coding scheme or code translates all of Alice's messages into codewords that can be 
input to n IID uses of the noisy channel. For example, suppose that Alice selects a message 
m to encode. We can write the codeword corresponding to message m as x n (m) because the 
input to the channel is some codeword that depends on m. 

The last part of the model involves Bob receiving the corrupted codeword y n over the 
channel and determining a potential codeword x n with which it should be associated. We 
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do not get into any details just yet for this last decoding part — imagine for now that it 



operates similarly to the majority vote code example. Figure 2A_ displays Shannon's model 
of communication that we have described. 

We calculate the rate of a given coding scheme as follows: 

= # of message bits 
# of channel uses 

In our model, the rate of a given coding scheme is 

R=-log(M), (2.59) 

n 

where log(M) is the number of bits needed to represent any message in the message set [M] 
and n is the number of channel uses. The capacity of a noisy channel is the highest rate at 
which it can communicate information reliably. 

We also need a way to determine the performance of any given code. Here, we list several 
measures of performance. Let C = {x n (m)} me , M ^ represent a code that Alice and Bob choose, 
where x n (m) denotes each codeword corresponding to the message m. Let p e (m,C) denote 
the probability of error when Alice transmits a message m 6 [M] using the code C. We 
denote the average probability of error as 

1 M 
P*{C) = --Y,Pe{rn,C). (2.60) 



M 

m=l 



The maximal probability of error is 



p* e (C) = max p e (m,C). (2.61) 

m€[M] 

Our ultimate aim is to make the maximal probability of error p* e (C) arbitrarily small, but 
the average probability of error p e (C) is important in the analysis. These two performance 
measures are related — the average probability of error is of course small if the maximal 
probability of error is. Perhaps surprisingly, the maximal probability is small for at least 
half of the messages if the average probability of error is. We make this statement more 
quantitative in the following exercise. 

Exercise 2.2.1 Use Markov's inequality to prove that the following upper bound on the 
average probability of error 

^5> e (m,C)<e (2.62) 

m 

implies the following upper bound for at least half of the messages m: 

p e {m,C)<2e. (2.63) 
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You may have wondered why we use the random sequence X n to model the inputs to the 
channel. We have already stated that Alice's message is a uniform random variable, and the 
codewords in any coding scheme directly depend on the message to be sent. For example, in 
the majority vote code, the channel inputs are always '000' whenever the intended message 
is '0' and similarly for the channel inputs '111' and the message '1'. So why is there a need 
to overcomplicate things by modeling the channel inputs as the random variable X n when 
it seems like each codeword is a deterministic function of the intended message? We are not 
yet ready to answer this question but will return to it shortly. 

We should also stress an important point before proceeding with Shannon's ingenious 
scheme for proving the existence of reliable codes for a noisy channel. In the above model, 
we described essentially two "layers of randomness" : 

1. The first layer of randomness is the uniform random variable associated with Alice's 
choice of a message. 

2. The second layer of randomness is the noisy channel. The output of the channel is 
a random variable because we cannot always predict the output of the channel with 
certainty. 

It is not possible to "play around" with these two layers of randomness. The random 
variable associated with Alice's message is fixed as a uniform random variable because we 
assume ignorance of Alice's message. The conditional probability distribution of the noisy 
channel is also fixed. We are assuming that Alice and Bob can learn the conditional prob- 
ability distribution associated with the noisy channel by estimating it. Alternatively, we 
may assume that a third party has knowledge of the conditional probability distribution and 
informs Alice and Bob of it in some way. Regardless of how they obtain the knowledge of 
the distribution, we assume that they both know it and that it is fixed. 

2.2.4 Description of the Proof of Shannon's Channel Coding The- 
orem 

We are now ready to present an overview of Shannon's technique for proving the existence 
of a code that can achieve the capacity of a given noisy channel. Some of the methods 
that Shannon uses in his outline of a proof are similar to those in the first coding theorem. 
We again use the channel a large number of times so that the Law of Large Numbers from 
probability theory comes into play and allow for a small probability of error that vanishes as 
the number of channel uses becomes large. If the notion of typical sequences is so important 
in the first coding theorem, we might suspect that it should be important in the noisy channel 
coding theorem as well. The typical set captures a certain notion of efficiency because it is 
a small set when compared to the set of all sequences, but it is the set that has almost all 
of the probability. Thus, we should expect this efficiency to come into play somehow in the 
channel coding theorem. 

The aspect of Shannon's technique for proving the noisy channel coding theorem that is 
different from the other ideas in the first theorem is the idea of random coding. Shannon's 
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technique adds a third layer of randomness to the model given above (recall that the first 
two are Alice's random message and the random nature of the noisy channel). 

The third layer of randomness is to choose the codewords themselves in a random fashion 
according to a random variable X, where we choose each letter Xi of a given codeword x n 
independently according to the distribution px(%i)- It is for this reason that we model the 
channel inputs as a random variable. We can then write each codeword as a random variable 
X n (m). The probability distribution for choosing a particular codeword x n (m) is 

Pr{X n (m) =x n (m)} = px 1 ,x 2 ,...,x n {xi{m),x 2 {m), . . . ,x n (m)) (2.64) 

= px{xi{m))p x (x2{m)) ■ ■■px{x n {m)) (2.65) 



Y[px(xi(m)). (2.66) 



The important result to notice is that the probability for a given codeword factors because 
we choose the code in an IID fashion, and perhaps more importantly, the distribution of each 
codeword has no explicit dependence on the message m with which it is associated. That 
is, the probability distribution of the first codeword is exactly the same as the probability 
distribution of all of the other codewords. The code C itself becomes a random variable in 
this scheme for choosing a code randomly. We now let C refer to the random variable that 
represents a random code, and we let Cq represent any particular deterministic code. The 
probability of choosing a particular code Cq = {^ n (^)} me [Mi is 

M n 

pc(c )= nn^^™))' ( 2 - 6? ) 

771=1 7 = 1 

and this probability distribution again has no explicit dependence on each message m in the 
code C . 

Choosing the codewords in a random way allows for a dramatic simplification in the 
mathematical analysis of the probability of error. Shannon's breakthrough idea was to 
analyze the expectation of the average probability of error, where the expectation is with 
respect to the random code C, rather than analyzing the average probability of error itself. 
The expectation of the average probability of error is 

E c {p e (C)}. (2.68) 

This expectation is much simpler to analyze because of the random way that we choose the 
code. Consider that 



E c {p e (C)}=E c j^f> e (m,C)l. 

I 777 = 1 J 



(2.69) 



Using linearity of the expectation, we can exchange the expectation with the sum so that 

1 M 
Ecfe(C)} = -5> c {p e (m,C)}. (2.70) 

777=1 
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Now, the expectation of the probability of error for a particular message m does not actually 
depend on the message m because the distribution of each random codeword X n (m) does not 
explicitly depend on m. This line of reasoning leads to the dramatic simplification because 
Ec{p e (m,C)} is then the same for all messages. So we can then say that 

Ec{p e (m,C)}=Ec{p e (l,C)}. (2.71) 

(We could have equivalently chosen any message instead of the first.) We then have that 

1 M 
^c{Pe(Q} = -5> c { Pe (l,C)} (2.72) 

m=l 

= E e {p e (l,C)}, (2.73) 

where the last step follows because the quantity Ec{p e (l,C)} has no dependence on m. We 
now only have to determine the expectation of the probability of error for one message 
instead of determining the expectation of the average error probability of the whole set. 
This simplification follows because random coding results in the equivalence of these two 
quantities. 

Shannon then determined a way to obtain a bound on the the expectation of the average 
probability of error (we soon discuss this technique briefly) so that 

Ecfe(C)} < e, (2.74) 

where e is some number that we can make arbitrarily small by letting the block size n 
become arbitrarily large. If it is possible to obtain a bound on the expectation of the 
average probability of error, then surely there exists some deterministic code C whose average 
probability of error meets this same bound: 

Pe(C ) < e. (2.75) 

If it were not so, then the original bound on the expectation would not be possible. This step 
is the derandomization step of Shannon's proof. Ultimately, we require a deterministic code 
with a high rate and arbitrarily small probability of error and this step shows the existence 
of such a code. The random coding technique is only useful for simplifying the mathematics 
of the proof. 

The last step of the proof is the expurgation step. It is an application of the result of 



Exercise |2.2.1| Recall that our goal is to show the existence of a high rate code that has low 
maximal probability of error. But so far we only have a bound on the average probability 
of error. In the expurgation step, we simply throw out the half of the codewords with the 
worst probability of error. Throwing out the worse half of the codewords reduces the number 
of messages by a factor of two, but only has a negligible impact on the rate of the code. 
Consider that the number of messages is 2 nR where R is the rate of the code. Thus, the 
number of messages is 2 n ^ ~^> after throwing out the worse half of the codewords, and the 
rate R — - is asymptotically equivalent to the rate R. After throwing out the worse half of 
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Figure 2.5: The above figure depicts the notion of a conditionally typical set. Associated to every input 
sequence x n is a conditionally typical set consisting of the likely output sequences. The size of this condi- 
tionally typical set is ss 2 nH ( Y ' x > . It is exponentially smaller than the set of all output sequences whenever 
the conditional random variable is not uniform. 



the codewords, the result of Exercise |2.2.1| shows that the following bound then applies to 
the maximal probability of error: 

P* e (C ) < 2e. (2.76) 

This last expurgation step ends the analysis of the probability of error. 

We now discuss the size of the code that Alice and Bob employ. Recall that the rate of 
the code is R = \og(M)/n. It is convenient to define the size M of the message set [M] in 
terms of the rate R. When we do so, the size of the message set is 



M 



-\TlR 



(2.77) 



What is peculiar about the message set size when defined this way is that it grows exponen- 
tially with the number of channel uses. But recall that any given code exploits n channel 
uses to send M messages. So when we take the limit as the number of channel uses tends 
to infinity, we are implying that there exists a sequence of codes whose messages set size is 
M = 2 nR and number of channel uses is n. We are focused on keeping the rate of the code 
constant and use the limit of n to make the probability of error vanish for a certain fixed 
rate R. 

What is the maximal rate at which Alice can communicate to Bob reliably? We need 
to determine the number of distinguishable messages that Alice can reliably send to Bob, 
and we require the notion of conditional typicality to do so. Consider that Alice chooses 
codewords randomly according to random variable X with probability distribution px{x). 
By the asymptotic equipartition theorem, it is highly likely that each of the codewords that 
Alice chooses is a typical sequence with sample entropy close to H(X). In the coding scheme, 
Alice transmits a particular codeword x n over the noisy channel and Bob receives a random 
sequence Y n . The random sequence Y n is a random variable that depends on x n through the 
conditional probability distribution pY\x(y\x). We would like a way to determine the number 
of possible output sequences that are likely to correspond to a particular input sequence x n . 
A useful entropic quantity for this situation is the conditional entropy H(Y\X), the technical 



details of which we leave for Chapter 10. For now, just think of this conditional entropy as 
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Figure 2.6: The above figure depicts the packing argument that Shannon used. The channel induces a 
conditionally typical set corresponding to each codeword x n (i) where i € {1,...,M}. The size of each 
conditionally typical output set is 2 nH< - Y ' x >. The size of the typical set of all output sequences is 2 nH ( Y >. 
These sizes suggest that we can divide the output typical set into M conditionally typical sets and be able 
to distinguish M sa 2 nH ( Y > /2 nH( - Y ' x > messages without error. 



measuring the uncertainty of a random variable Y when one already knows the value of the 
random variable X. The conditional entropy H(Y\X) is always less than the entropy H(Y) 
unless X and Y are independent. This inequality holds because knowledge of a correlated 
random variable X does not increase the uncertainty about Y . It turns out that there is a 
notion of conditional typicality, similar to the notion of typicality, and a similar asymptotic 



equipartition theorem holds for conditionally typical sequences (more details in Section 13.9). 
This theorem also has three important properties. For each input sequence x n , there is a 
corresponding conditionally typical set with the following properties: 

1. It has almost all of the probability — it is highly likely that a random channel output 
sequence is conditionally typical given a particular input sequence. 

2. Its size is sa 2 nH ^ x \ 

3. The probability of each conditionally typical sequence y n , given knowledge of the input 



sequence x , is 



2~nH(Y\X) 



If we disregard knowledge of the input sequence used to generate an output sequence, 
the probability distribution that generates the output sequences is 



Pr(y) = J2pY\x(y\x)p x (x) 



(2.78) 



We can think that this probability distribution is the one that generates all the possible 
output sequences. The likely output sequences are in an output typical set of size 2 nH{ - Y \ 

We are now in a position to describe the structure of a random code and the size of the 
message set. Alice generates 2 nR codewords according to the distribution pxi%) an d suppose 
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for now that Bob has knowledge of the code after Alice generates it. Suppose Alice sends 
one of the codewords over the channel. Bob is ignorant of the transmitted codeword, so from 
his point of view, the output sequences are generated according to the distribution pyijj)- 
Bob then employs typical sequence decoding. He first determines if the output sequence y n 
is in the typical output set of size 2 nHlyY \ If not, he declares an error. The probability of this 
type of error is small by the asymptotic equipartition theorem. If the output sequence y n 
is in the output typical set, he uses his knowledge of the code to determine the most likely 
conditionally typical set of size 2 nH ^ Y ^ x ^ to which the output sequence belongs. If he decodes 
an output sequence y n to the wrong conditionally typical set, then an error occurs. This 
last type of error suggests how they might structure the code in order to prevent this type 
of error from happening. If they structure the code so that the output conditionally typical 
sets do not overlap too much, then Bob should be able to decode each output sequence y n to 
a unique input sequence x n with high probability. This line of reasoning suggests that they 
should divide the set of output typical sequences into M sets of conditionally typical output 
sets, each of size 2 nH( - Y ^ x \ Thus, if they set the number of messages M = 2 nR as follows 

nH(Y) 
Z 2nH(Y\X) ~ Z ' y^.tV) 

then our intuition is that Bob should be able to decode correctly with high probability. Such 
an argument is a "packing" argument because it shows how to pack information efficiently 



into the space of all output sequences. Figure |2.6| gives a visual depiction of the packing 
argument. It turns out that this intuition is correct — Alice can reliably send information to 
Bob if the quantity H{Y) — H{Y\X) bounds the rate R: 

R<H{Y)-H{Y\X). (2.80) 

A rate less than H(Y) — H(Y\X) ensures that we can make the expectation of the average 
probability of error as small as we would like. We then employ the derandomization and 
expurgation steps, discussed before, in order to show that there exists a code whose maximal 
probability of error vanishes as the number n of channel uses tends to infinity. 

The entropic quantity H(Y) — H(Y\X) deserves special attention because it is another 
important entropic quantity in information theory. It is the mutual information between 
random variables X and Y and we denote it as 

I(X;Y) = H(Y)-H(Y\X). (2.81) 

It is important because it arises as the limiting rate of reliable communication. We will 
discuss its properties in more detail throughout this book. 

There is one final step that we can take to strengthen the above coding scheme. We 
mentioned before that there are three layers of randomness in the coding construction: Alice's 
uniform choice of a message, the noisy channel, and Shannon's random coding scheme. The 
first two layers of randomness we do not have control over. But we actually do have control 
over the last layer of randomness. Alice chooses the code according to the distribution px(x). 
She can choose the code according to any distribution that she would like. If she chooses it 
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according to px(x), the resulting rate of the code is the mutual information I(X; Y). We will 
prove later on that the mutual information I(X; Y) is a concave function of the distribution 
Px(x) when the conditional distribution Py\x(v\x) is fixed. Concavity implies that there is 
a unique distribution p* x { x ) that maximizes the mutual information. Thus, Alice should 
choose the optimum distribution p* x {x) when she randomly generates the code, and this 
choice gives the largest possible rate of communication that they could have. This largest 
possible rate is the capacity of the channel and we denote it as 

C(J\f) = max I(X;Y). (2.82) 

Px(x) 

Our discussion here is just an overview of Shannon's channel capacity theorem. In Sec- 



tion 13.10 , we give a full proof of this theorem after having developed some technical tools 
needed for a formal proof. 

We clarify one more point. In the discussion of the operation of the code, we mentioned 
that Alice and Bob both have knowledge of the code. Well, how can Bob know the code if a 
noisy channel connects Alice to Bob? The solution to this problem is to assume that Alice 
and Bob have unbounded computation on their local ends. Thus, for a given code that uses 
the channel n times, they can both compute the above optimization problem and generate 
"test" codes randomly until they determine the best possible code to employ for n channel 
uses. They then both end up with the unique, best possible code for n uses of the given 
channel. This scheme might be impractical, but nevertheless, it provides a justification for 
both of them to have knowledge of the code that they use. 

We have said before that the capacity C(J\f) is the maximal rate at which Alice and Bob 
can communicate. But in our discussion above, we did not prove optimality — we only proved 
a direct coding theorem for the channel capacity theorem. It took quite some time and effort 
to develop this elaborate coding procedure — along the way, we repeatedly invoked one of the 
"elephant guns" from probability theory, the law of large numbers. It perhaps seems intuitive 
that typical sequence coding and decoding should lead to optimal code constructions. Typical 
sequences exhibit some kind of asymptotic efficiency by being the most likely to occur, but in 
the general case, their cardinality is exponentially smaller than the set of all sequences. But is 
this intuition about typical sequence coding correct? Is it possible that some other scheme for 
coding might beat this elaborate scheme that Shannon devised? Without a converse theorem 
that proves optimality, we would never know! If you recall from our previous discussion in 



Section 2.1.3 about coding theorems, we stressed how important it is to prove a converse 
theorem that matches the rate that the direct coding theorem suggests is optimal. For now, 
we delay the proof of the converse theorem because the tools for proving it are much different 



from the tools we described in this section. For now, accept that the formula in (2.82) is 
indeed the optimal rate at which two parties can communicate and we will prove this result 
in a later chapter. 

We end the description of Shannon's channel coding theorem by giving the formal state- 
ments of the direct coding theorem and the converse theorem. The formal statement of the 
direct coding theorem is as follows: 

If the rate of communication is less than the channel capacity, then it is possible 
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for Alice to communicate reliably to Bob, in the sense that a sequence of codes 
exists whose maximal probability of error vanishes as the number of channel uses 
tends to infinity 

The formal statement of the converse theorem is as follows: 

If a reliable code exists, then the rate of this code is less than the channel capacity. 

Another way of stating the converse proves to be useful later on: 

If the rate of a coding scheme is greater than the channel capacity, then a reliable 
code does not exist, in the sense that the error probability of the coding scheme 
is bounded away from zero. 



2.3 Summary 



A general communication scenario involves one sender and one receiver. In the classical 
setting, we discussed two information processing tasks that they can perform. The first task 
was data compression or source coding, and we assumed that the sender and receiver share 
a noiseless classical bit channel that they use a large number of times. We can think of this 
noiseless classical bit channel as a noiseless dynamic resource that the two parties share. The 
resource is dynamic because we assume that there is some physical medium through which 
the physical carrier of information travels in order to get from the sender to the receiver. 
It was our aim to count the number of times they would have to use the noiseless resource 
in order to send information reliably. The result of Shannon's source coding theorem is 
that the entropy gives the minimum rate at which they have to use the noiseless resource. 
The second task we discussed was channel coding and we assumed that the sender and 
receiver share a noisy classical channel that they can use a large number of times. This 
noisy classical channel is a noisy dynamic resource that they share. We can think of this 
information processing task as a simulation task, where the goal is to simulate a noiseless 
dynamic resource by using a noisy dynamic resource in a redundant way. This redundancy 
is what allows Alice to communicate reliably to Bob, and reliable communication implies 
that they have effectively simulated a noiseless resource. We again had a resource count for 
this case, where we counted n as the number of times they use the noisy resource and nC is 
the number of noiseless bit channels they simulate (where C is the capacity of the channel). 
This notion of resource counting may not seem so important for the classical case, but it 
becomes much more important for the quantum case. 

We now conclude our overview of Shannon's information theory. The main points to take 
home from this overview are the ideas that Shannon employed for constructing source and 
channel codes. We let the information source emit a large sequence of data, or similarly, 
we use the channel a large number of times so that we can invoke the law of large numbers 
from probability theory. The result is that we can show vanishing error for both schemes 
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by taking an asymptotic limit. In Chapter 13, we develop the theory of typical sequences in 



detail, proving many of the results taken for granted in this overview. 

In hindsight, Shannon's methods for proving the two coding theorems are merely a tour 
de force for one idea from probability theory: the law of large numbers. Perhaps, this view- 
point undermines the contribution of Shannon, until we recall that no one had even come 
close to devising these methods for data compression and channel coding. The theoretical 
development of Shannon is one of the most important contributions to modern science be- 
cause his theorems determine the ultimate rate at which we can compress and communicate 
information. 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



64 CHAPTER 2. CLASSICAL SHANNON THEORY 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



Part II 



The Quantum Theory 



65 



CHAPTER 3 



The Noiseless Quantum Theory 



The simplest quantum system is the physical quantum bit or qubit. The qubit is a two-level 
quantum system — example qubit systems are the spin of an electron, the polarization of a 
photon, or a two-level atom with a ground state and an excited state. We do not worry too 
much about physical implementations in this chapter, but instead focus on the mathematical 
postulates of the quantum theory and operations that we can perform on qubits. 

We progress from qubits to a study of physical qudits. Qudits are quantum systems 
that have d levels and are an important generalization of qubits. Again, we do not discuss 
physical realizations of qudits. 

Noise can affect quantum systems, and we must understand methods of modeling noise in 
the quantum theory because our ultimate aim is to construct schemes for protecting quantum 
systems against the detrimental effects of noise. In Chapter [[} we remarked on the different 
types of noise that occur in nature. The first, and perhaps more easily comprehensible type 
of noise, is that which is due to our lack of information about a given scenario. We observe 
this type of noise in a casino, with every shuffle of cards or toss of dice. These events are 
random, and the random variables of probability theory model them because the outcomes 
are unpredictable. This noise is the same as that in all classical information processing 
systems. We can engineer physical systems to improve their robustness to noise. 

On the other hand, the quantum theory features a fundamentally different type of noise. 
Quantum noise is inherent in nature and is not due to our lack of information, but is due 
rather to nature itself. An example of this type of noise is the "Heisenberg noise" that 
results from the uncertainty principle. If we know the momentum of a given particle from 
performing a precise measurement of it, then we know absolutely nothing about its position— 
a measurement of its position gives a random result. Similarly, if we know the rectilinear 
polarization of a photon by precisely measuring it, then a future measurement of its diagonal 
polarization will give a random result. It is important to keep the distinction clear between 
these two types of noise. 

We explore the postulates of the quantum theory in this chapter, by paying particular 
attention to qubits. These postulates apply to a closed quantum system that is isolated from 
everything else in the universe. We label this first chapter "Noiseless Quantum Theory" 
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because closed quantum systems do not interact with their surroundings and are thus not 
subject to corruption and information loss. Interaction with surrounding systems can lead to 
loss of information in the sense of the classical noise that we described above. Closed quantum 
systems do undergo a certain type of quantum noise, such as that from the uncertainty 
principle and the act of measurement, because they are subject to the postulates of the 
quantum theory. The name "Noiseless Quantum Theory" thus indicates the closed, ideal 
nature of the quantum systems discussed in this chapter. 

This chapter introduces the four postulates of the quantum theory. The mathematical 
tools of the quantum theory rely on the fundamentals of linear algebra — vectors and ma- 
trices of complex numbers. It may seem strange at first that we need to incorporate the 
machinery of linear algebra in order to describe a physical system in the quantum theory, 
but it turns out that this description uses the simplest set of mathematical tools to predict 
the phenomena that a quantum system exhibits. The hallmark of the quantum theory is 
that certain operations do not commute with one another, and matrices are the simplest 
mathematical objects that capture this idea of noncommutativity. 



3.1 Overview 



We first briefly overview how information is processed with quantum systems. This usually 
consists of three steps: state preparation, quantum operations, and measurement. State 
preparation is where we initialize a quantum system to some beginning state, depending on 
what operation we would like a quantum system to execute. There could be some classical 
control device that initializes the state of the quantum system. Observe that the input system 
for this step is a classical system, and the output system is quantum. After initializing the 
state of the quantum system, we perform some quantum operations that evolve its state. This 
stage is where we can take advantage of quantum effects for enhanced information processing 
abilities. Both the input and output systems of this step are quantum. Finally, we need some 
way of reading out the result of the computation, and we can do so with a measurement. 



The input system for this step is quantum, and the output is classical. Figure |3.1| depicts 
all of these steps. In a quantum communication protocol, spatially separated parties may 
execute different parts of these steps, and we are interested in keeping track of the nonlocal 



resources needed to implement a communication protocol. Section 3.2 describes quantum 



states (and thus state preparation), Section 3.3 describes the noiseless evolution of quantum 



states, and Section 3.4 describes "read out" or measurement. For now, we assume that we 
can perform all of these steps perfectly and later chapters discuss how to incorporate the 
effects of noise. 



3.2 Quantum Bits 



The simplest quantum system is a two-state system: a physical qubit. Let |0) denote one 
possible state of the system. The left vertical bar and the right angle bracket indicate that we 

©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



3.2. QUANTUM BITS 



69 




\f> - u - U\f) - ^ 



m 



Figure 3.1: All of the steps in a typical noiseless quantum information processing protocol. A classical 
control (depicted by the thick black line on the left) initializes the state of a quantum system. The quantum 
system then evolves according to some unitary operation (described in Section 3.3 1. The final step is a 
measurement that reads out some classical data m from the quantum system. 



are using the Dirac notation to represent this state. The Dirac notation has some advantages 
for performing calculations in the quantum theory, and we highlight some of these advantages 
as we progress through our development. Let |1) denote another possible state of the qubit. 
We can encode a classical bit or cbit into a qubit with the following mapping: 



o->|o), 



i 



|i>. 



(3.1) 



So far, nothing in our description above distinguishes a classical bit from a qubit, except 
for the funny vertical bar and angle bracket that we place around the bit values. The quantum 
theory predicts that the above states are not the only possible states of a qubit. Arbitrary 
superpositions (linear combinations) of the above two states are possible as well because the 
quantum theory is a linear theory. Suffice it to say that the linearity of the quantum theory 
results from the linearity of Schrodinger's equation that governs the evolution of quantum 
systems]]] A general noiseless qubit can be in the following state: 



a|0)+/?|l>, 



(3.2) 



where the coefficients a and j3 are arbitrary complex numbers with unit norm: 



\a\ 2 + \8\ 2 = l. 



(3.3) 



The coefficients a and j3 are probability amplitudes — they are not probabilities themselves 
but allow us to calculate probabilities. The unit-norm constraint results from the Born 
rule (the probabilistic interpretation) of the quantum theory, and we speak more on this 
constraint and probability amplitudes when we introduce the measurement postulate. 

The possibility of superposition states indicates that we cannot represent the states |0) 
and |1) with the Boolean algebra of the respective classical bits and 1 because Boolean 
algebra does not allow for superposition states. We instead require the mathematics of linear 
algebra to describe these states. It is beneficial at first to define a vector representation of 



^e will not present Schrodinger's equation in this book, but instead focus on a "quantum information" 
presentation of the quantum theory. Griffith's book on quantum mechanics introduces the quantum theory 
from the Schrodinger equation if you are interested [116] . 
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the states |0) and |1): 



10) 
|1) 



1 




1 



(3.4) 
(3.5) 



The |0) and |1) states are called "kets" in the language of the Dirac notation, and it is best 



at first to think of them merely as column vectors. The superposition state in (3.2) then has 
a representation as the following two-dimensional vector: 



a\0)+/3\l) 



a 



(3.6) 



The representation of quantum states with vectors is helpful in understanding some of the 
mathematics that underpins the theory, but it turns out to be much more useful for our 
purposes to work directly with the Dirac notation. We give the vector representation for 
now, but later on, we will only employ the Dirac notation. 



The Block sphere, depicted in Figure |3.2[ gives a valuable way to visualize a qubit. 
Consider any two qubits that are equivalent up to a differing global phase. For example, 
these two qubits could be 



|Vo> = |^>, |^i)=e**M, (3.7) 

where < \ < 2ir. There is a sense in which these two qubits are physically equivalent 
because they give the same physical results when we measure them (more on this point 



when we introduce the measurement postulate in Section 3.4). Suppose that the probability 



amplitudes a and j3 have the following respective representations as complex numbers: 



a 





r e^°, 



r\e 



Kfl 



(3.8) 
(3.9) 



We can factor out the phase e tipo from both coefficients a and 0, and we still have a state 

(3.10) 



that is physically equivalent to the state in (3.2): 

|^} = r |0) + r 1 e^ 1 - w) |l), 



where we redefine | 

Let (p = ipi — <po, where < ip < 2ir. Recall that the unit-norm constraint requires \vq 



to represent the state because of the equivalence mentioned in (3.7). 



in. I 

that 



1. We can thus parametrize the values of ro and r% in terms of one parameter 9 so 



r = cos(0/2), 
n = sin(0/2). 



(3.11) 
(3.12) 



The parameter 9 varies between and n. This range of 9 and the factor of two give a unique 
representation of the qubit. One may think to have 9 vary between and 2-ir and omit the 
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Figure 3.2: The Bloch sphere representation of a qubit. Any qubit \ip) admits a representation in terms 
of two angles 6 and <p where < 9 < 7r and < tp < 2tt. The state of any qubit in terms of these angles is 
\tp) =cos(6»/2)|0)+e l¥, sin(6l/2)|l). 



factor of two, but this parametrization would not uniquely characterize the qubit in terms 
of the parameters 9 and ip. The parametrization in terms of 9 and (p gives the Bloch sphere 



representation of the qubit in (3.2): 



cos(#/2)|0)+sin(#/2)e^|l). 



(3.13) 



We can plot the state of any qubit on a unit sphere, called the Bloch sphere. Figure |3.2| 
depicts this representation of a qubit. 

In linear algebra, column vectors are not the only type of vectors — row vectors are useful 
as well. Is there an equivalent of a row vector in Dirac notation? The Dirac notation provides 
an entity called a "bra," that has a representation as a row vector. The bras corresponding 
to the kets |0) and |1) are as follows: 



(0| = [1 0], 

<i| = [o 1], 

and are the matrix conjugate transpose of the kets |0) and |1): 

<0| = (|0)) f , 
<1| = (11))*. 



(3.14) 
(3.15) 



(3.16) 
(3.17) 



We require the conjugate transpose operation (as opposed to just the transpose) because the 
mathematical representation of a general quantum state can have complex entries. 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



72 



CHAPTER 3. THE NOISELESS QUANTUM THEORY 



The bras do not represent quantum states, but are helpful in calculating probability 



amplitudes. For our example qubit in (|3.2|), suppose that we would like to determine the 
probability amplitude that the state is 
(0| as follows: 



We can combine the state in (3.2) with the bra 



<0|W 



<0|(a|0> + /?|l» 
a(0||0)+/?(0||l) 

1 



a 



[1 0] 







P[ i o 





1 



a ■ 1 
a. 



[3-0 



(3.18) 
(3.19) 

(3.20) 

(3.21) 
(3.22) 



The above calculation may seem as if it is merely an exercise in linear algebra, with a 
"glorified" Dirac notation, but it is a standard calculation in the quantum theory. A quantity 
like (O||-0) occurs so often in the quantum theory that we abbreviate it as 



<(#) = <0||V>, 



(3.23) 



and the above notation is known as a "braket."o]The physical interpretation of the quantity 
(0|^) is that it is the probability amplitude for being in the state |0), and likewise, the 
quantity (1|^>) is the probability amplitude for being in the state |1). We can also determine 
that the amplitude (1|0) (for the state |0) to be in the state |1)) and the amplitude (0|1) 
are both equal to zero. These two states are orthogonal states because they have no overlap. 
The amplitudes (0|0) and (1|1) are both equal to one by following a similar calculation. 

Our next task may seem like a frivolous exercise, but we would like to determine the 
amplitude for any state \t/j) to be in the state \i/j), i.e., to be itself. Following the above 
method, this amplitude is (tf)\tf)) and we calculate it as 



<</#} = «0|a* + (l|/r)(a|0) 
= a*a(0|0)+/Ta(l|0> 
= \a\ 2 + \P\ 2 
= 1, 



0|1» 

■a*/3{0\l) 



/3*/3<l|l) 



(3.24) 
(3.25) 
(3.26) 
(3.27) 



where we have used the orthogonality relations of (0|0), (1|0), (0|1), and (1|1), and the unit- 
norm constraint. We come back to the unit-norm constraint in our discussion of quantum 
measurement, but for now, we have shown that any quantum state has a unit amplitude for 
being itself. 

The states |0) and |1) are a particular basis for a qubit that we call the computational 
basis. The computational basis is the standard basis that we employ in quantum computation 



2 It is for this (silly) reason that Dirac decided to use the names "bra" and "ket," because putting them 
together gives a "braket." The names in the notation may be silly, but the notation itself has persisted over 
time because this way of representing quantum states turns out to be useful. We will avoid the use of the 
terms "bra" and "ket" as much as we can, only resorting to these terms if necessary. 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



3.2. QUANTUM BITS 



73 



and communication, but other bases are important as well. Consider that the following two 
vectors form an orthonormal basis: 



v^L i J' 



i 

72 



-1 



(3.28) 



The above alternate basis is so important in quantum information theory that we define a 
Dirac notation shorthand for it, and we can also define the basis in terms of the computational 
basis: 



1+) 
I-) 



|0) + |1) 
V2 ' 

|Q)-|i) 

v/2 



(3.29) 
(3.30) 



The common names for this alternate basis are the "+/— " basis, the Hadamard basis, or 
the diagonal basis. It is preferable for us to use the Dirac notation, but we are using the 
vector representation as an aid for now. 

Exercise 3.2.1 Determine the Bloch sphere angles 9 and <p for the states |+) and | — ). 



What is the amplitude that the state in (3.2) is in the state |+)? What is the amplitude 
that it is in the state | — )? These are questions to which the quantum theory provides simple 
answers. We employ the bra (+| and calculate the amplitude (+|"0) as 



(+1 



(+IH0> + /3|1» 

a(+\0) + (3(+\l) 
a + [3 

~7T' 



(3.31) 
(3.32) 

(3.33) 



The result follows by employing the definition in (3.29) and doing similar linear algebraic 
calculations as the example in (3.22). We can also calculate the amplitude (— \if)) as 

a — j3 



^/2 



(3.34) 



The above calculation follows from similar manipulations. 

The +/— basis is a complete orthonormal basis, meaning that we can represent any qubit 
state in terms of the two basis states |+) and | — }. Indeed, the above probability amplitude 



calculations suggest that we can represent the qubit in (3.2) as the following superposition 

state: 

a + f3\ , , x , fa — (3 s 



1+) 



•>■ 



(3.35) 



\/2 / V \/2 

The above representation is an alternate one if we would like to "see" the qubit state repre- 



sented in the +/— basis. We can substitute the equivalences in (3.33) and (3.34) to represent 
the state \if)) as 

m = (+m+) + {-m-)- (3.36) 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



74 CHAPTER 3. THE NOISELESS QUANTUM THEORY 



The amplitudes (+\ifj) and (— \tp) are both scalar quantities so that the above quantity is 
equivalent to the following one: 

h/>} = l+)<+h/>} + l-}<-h/>}. (3.37) 

The order of the multiplication in the terms |+)(+|^) and |— )(— \ip) does not matter, i.e., 
the following equivalence holds 

l+)«+h/>}) = (l+X+IM, (3-38) 

and the same for |— )(— \ip). The quantity on the left is a ket multiplied by an amplitude, 
whereas the quantity on the right is a linear operator multiplying a ket, but linear algebra 
tells us that these two quantities are equivalent. The operators |+}(+| and | — }(— | are 
special operators — they are rank-one projection operators, meaning that they project onto 
a one-dimensional subspace. Using linearity, we have the following equivalence: 

h/>> = (l+><+l + |-)HM- (3-39) 

The above equation indicates a seemingly trivial, but important point — the operator |+}(+| + 
| — }( — | is equivalent to the identity operator and we can write 

i =!+)(+! + I-X-I, (3-40) 

where I stands for the identity operator. This relation is known as the completeness relation 
or the resolution of the identity. Given any orthonormal basis, we can always construct a 
resolution of the identity by summing over the rank-one projection operators formed from 
each of the orthonormal basis states. For example, the computational basis states give 
another way to form a resolution of the identity operator: 

/=|0)(0| + |1)(1|. (3.41) 

This simple trick provides a way to find the representation of a quantum state in any basis. 



3.3 Reversible Evolution 

Physical systems evolve as time progresses. The application of a magnetic field to an electron 
can change its spin and pulsing an atom with a laser can excite one of its electrons from a 
ground state to an excited state. These are only a couple of ways in which physical systems 
can change. 

The Schrodinger equation governs the evolution of a closed quantum system. In this 
book, we will not even state the Schrodinger equation, but we will instead focus on its major 
result. The evolution of a closed quantum system is reversible if we do not learn anything 
about the state of the system. Reversibility implies that we can determine the input state 
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Figure 3.3: The above figure is a quantum circuit diagram that depicts the evolution of a quantum state 
\ip) according to a unitary operator U . 



of an evolution given the output state and knowledge of the evolution. An example of a 
single-qubit reversible operation is a NOT gate: 



|0>-|1>, 



|i>-|o>. 



(3.42) 



In the classical world, we would say that the NOT gate merely flips the value of the input 
classical bit. In the quantum world, the NOT gate flips the basis states |0) and |1). The 
NOT gate is reversible because we can simply apply the NOT gate again to recover the 
original input state — the NOT gate is its own inverse. 

In general, a closed quantum system evolves according to a unitary operator U. Unitary 
evolution implies reversibility because a unitary operator always possesses an inverse — its 
inverse is merely U\ This property gives the relations: 

tfU = UU ] 



I. 



(3.43) 



The unitary property also ensures that evolution preserves the unit-norm constraint (an 
important requirement for a physical state that we discuss in the section on measurement). 



Consider applying the unitary operator U to the example qubit state in (3.2): 



U\ 



(3.44) 



Figure |3.3| depicts a quantum circuit diagram for unitary evolution. 

The bra that is dual to the above state is (i/)\U* (we again apply the conjugate transpose 



operation to get the bra). We showed in (3.24 3.27) that every quantum state should have a 
unit amplitude for being itself. This relation holds for the state U\ip) because the operator 
U is unitary: 

(1>\tfU\1>) = WW) = (V#) = 1- (3-45) 

The assumption that a vector always has a unit amplitude for being itself is one of the 
crucial assumptions of the quantum theory, and the above reasoning demonstrates that 
unitary evolution complements this assumption. 



3.3.1 Matrix Representations of Operators 

We now explore some properties of the NOT gate. Let X denote the operator corresponding 
to a NOT gate. The action of X on the computational basis states is as follows: 



X\i) 



1), 



(3.46) 
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where i = {0, 1} and © denotes binary addition. Suppose the NOT gate acts on a superpo- 
sition state: 

X(a\0)+(3\1)) (3.47) 

By the linearity of the quantum theory, the X operator distributes so that the above expres- 
sion is equal to the following one: 



aX\Q) + pX\l) = a|l) +/?|0>. 



(3.48) 



Indeed, the NOT gate X merely flips the basis states of any quantum state when represented 
in the computational basis. 

We can determine a matrix representation for the operator X by using the bras (0| and 



(1|. Consider the relations in (3.46). Let us combine the relations with the bra 

<0LY|0) = (0|1) 
<0UY|1) = (0|0) 



0, 
1. 



(3.49) 
(3.50) 



Likewise, we can combine with the bra (1|: 

(1LY|0) = <1|1> = 1, (3.51) 

(1|X|1) = (1|0) = 0. (3.52) 

We can place these entries in a matrix to give a matrix representation of the operator X: 



(0LY|0) (0|X|1) 
(1LY|0) (1|X|1) 



(3.53) 



where we order the rows according to the bras and order the columns according to the kets. 
We then say that 

"01 
1 



X 



(3.54) 



and adopt the convention that the symbol X refers to both the operator X and its matrix 
representation (this is an abuse of notation, but it should be clear from context when X 
refers to an operator and when it refers to the matrix representation of the operator). 

Let us now observe some uniquely quantum behavior. We would like to consider the 
action of the NOT operator X on the +/— basis. First, let us consider what happens if we 
operate on the |+) state with the X operator. Recall that the state |+) = l/\/2(|0} + |1}) 
so that 



X\+)=X 



|o) + |i) 



V2 
X|0)+X|1) 

^/2 
|1) + |0) 



V2 
!+)• 



(3.55) 

(3.56) 

(3.57) 
(3.58) 
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The above development shows that the state |+) is a special state with respect to the NOT 
operator X — it is an eigenstate of X with eigenvalue one. An eigenstate of an operator is one 
that is invariant under the action of the operator. The coefficient in front of the eigenstate 
is the eigenvalue corresponding to the eigenstate. Under a unitary evolution, the coefficient 
in front of the eigenstate is just a complex phase, but this global phase has no effect on 
the observations resulting from a measurement of the state because two quantum states are 
equivalent up to a differing global phase. 

Now, let us consider the action of the NOT operator X on the state | — }. Recall that 
| — } = l/\/2(|0) — |1)). Calculating similarly, we get that 



X\-) =X 



|o)-|i) 



V2 
__ X\0)-X\1) 

~ 71 
= |i> - |Q> 

So the state |— } is also an eigenstate of the operator X , but its eigenvalue is —1. 
We can find a matrix representation of the X operator in the +/— basis as well: 



(3.59) 

(3.60) 

(3.61) 
(3.62) 



*l+) (+1*1-) 




1 


*l+) (-1*1-) 




-1 



(3.63) 



This representation demonstrates that the X operator is diagonal with respect to the +/ — 
basis, and therefore, the +/— basis is an eigenbasis for the X operator. It is always handy 
to know the eigenbasis of a unitary operator U because this eigenbasis gives the states that 
are invariant under an evolution according to U. 

Let Z denote the operator that flips states in the +/— basis: 



Z\+) 



-), 



s|->-|+). 



(3.64) 



Using an analysis similar to that which we did for the X operator, we can find a matrix 
representation of the Z operator in the +/— basis: 



(+\Z\+) (+\Z\-) 

(-\z\+) (-\z\-) 



1 

1 



(3.65) 



Interestingly, the matrix representation for the Z operator in the +/— basis is the same as 
that for the X operator in the computational basis. For this reason, we call the Z operator 
the phase flip operatorjj 



3 A more appropriate name might be the "bit flip in the +/— basis operator," but this name is too long, 
so we stick with the term "phase flip." 
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We expect the following steps to hold because the quantum theory is a linear theory: 

z\+) + z\-) |-) + |+) |+} + |-} 



z 
z 



l+) + l- 


"> 


V2 


"> 



V2 



V2 
Z\+)-Z\ 



-) 



V2 



V2 



V2 

1+) 



-> 



V2 



(3.66) 
(3.67) 



The above steps demonstrate that the states l/v2(|+) + | — )) and l/v2(|+) — | — )) are both 
eigenstates of the Z operators. These states are none other than the respective computational 



basis states |0) and |1), by inspecting the definitions in (3.29 3.30) of the +/— basis. Thus, 



a matrix representation of the Z operator in the computational basis is 



;o|z|o) (o|z|i) 
;i|z|o> (i\z\i) 



1 o 

-1 



(3.68) 



and is a diagonalization of the operator Z. So, the behavior of the Z operator in the 
computational basis is the same as the behavior of the X operator in the +/— basis. 

3.3.2 Commutators and Anticommutators 

The commutator [A, B] of two operators A and B is as follows: 

[A,B} = AB-BA. (3.69) 

Two operators commute if and only if their commutator is equal to zero. 
The anticommutator {A, B} of two operators A and B is as follows: 

{A,B} = AB + BA. (3.70) 

We say that two operators anticommute if their anticommutator is equal to zero. 
Exercise 3.3.1 Find a matrix representation for [X, Z] in the basis {|0), |1)}. 

3.3.3 The Pauli Matrices 

The convention in quantum theory is to take the computational basis as the standard basis 
for representing physical qubits. The standard matrix representation for the above two 
operators is as follows when we choose the computational basis as the standard basis: 



"0 l" 


, z = 


"l 


" 


[l 0_ 




[u 


-1 



X 

The identity operator I has the following representation in any basis: 

1 
1 



(3.7i; 



(3.72) 
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Another operator, the Y operator, is a useful one to consider as well. The Y operator has 
the following matrix representation in the computational basis: 



Y 



-i 

% 



(3.73) 



It is easy to check that Y = iXZ, and for this reason, we can think of the Y operator 
as a combined bit and phase flip. The four matrices /, X, Y, and Z are special for the 
manipulation of physical qubits and are known as the Pauli matrices. 

Exercise 3.3.2 Show that the Pauli matrices are all Hermitian, unitary, they square to the 
identity, and their eigenvalues are ±1. 

Exercise 3.3.3 Represent the eigenstates of the Y operator in the computational basis. 

Exercise 3.3.4 Show that the Pauli matrices either commute or anticommute. 

Exercise 3.3.5 Let us label the Pauli matrices as ao = /, 0\ = X , a% = Y, and 03 = Z, 
Show that Tr{<7j<7j} = 25ij for all i,j G {0, . . . , 3}. 

3.3.4 Hadamard Gate 

Another important unitary operator is the transformation that takes the computational basis 
to the +/— basis. This transformation is the Hadamard transformation: 

|0) - |+), (3.74) 

|1) - I-}. (3.75) 

Using the above relations, we can represent the Hadamard transformation as the following 
operator: 

#=|+)(0| + |-><1|. (3.76) 



It is straightforward to check that the above operator implements the transformation in (3.75 ). 
Now consider a generalization of the above construction. Suppose that one orthonormal 
basis is {\i/Ji)} ie r 01 y and another is {|0i)} ie | O1 | where the index i merely indexes the states 
in each orthonormal basis. Then the unitary operator that takes states in the first basis to 
states in the second basis is 

£>*>M- (3-77) 

»=o,i 

Exercise 3.3.6 Show that the Hadamard operator H has the following matrix representation 
in the computational basis: 

" 1 1 



H = T2[^ -1 



(3.78) 
Exercise 3.3.7 Show that the Hadamard operator is its own inverse by employing the above 



matrix representation and by using its operator form in (3.76) 
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Figure 3.4: The above figure provides more labels for states on the Bloch sphere. The Z axis has its points 
on the sphere as eigenstates of the Pauli Z operator, the X axis has eigenstates of the Pauli X operator, 
and the Y axis has eigenstates of the Pauli Y operator. The rotation operators Rx{4>), Ry{4>), an d Rz{<j>) 
rotate a state on the sphere by an angle cf> about the respective X , Y , and Z axis. 



Exercise 3.3.8 If the Hadamard gate is its own inverse, then it takes the states |+) and 
| — ) to the respective states |0) and |1) and we can represent it as the following operator: 



H = \0)(- 



|1)(" 



Show that 



|o)(- 



Exercise 3.3.9 Show that HXH 



Hi)(-l = l+)(o| + 

Z and that HZH 



X. 



(3.79) 
(3.80) 



3.3.5 Rotation Operators 

We end this section on the evolution of quantum states by discussing "rotation evolutions" 
and by giving a more complete picture of the Bloch sphere. The rotation operators Rx{4>), 
i?y(0), Rz{^) are functions of the respective Pauli operators X , Y, Z where 



R x (<f>) = exp{«X0/2}, 
R Y (</>) = exp{iY<j>/2}, 
R z ((f)) = exp{iZ(f)/2}, 



(3.81) 
(3.82) 
(3.83) 



and 4> is some angle such that < (f) < 2ir. How do we determine a function of an operator? 
The standard way is to represent the operator in its diagonal basis and apply the function to 
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the eigenvalues of the operator. For example, the diagonal representation of the X operator 

is 



* = i+x+i-i-x-i- 

Applying the function exp{iX(p/2} to the eigenvalues of X gives 

Rx{4>) = exp{z0/2}|+}(+| + exp{-^/2}|-}(- 



(3.84) 



(3.85) 



More generally, suppose that an Hermitian operator A has the spectral decomposition 
A = ^2ii0n\i){i\ for some orthonormal basis {\i)}- Then the operator f(A) for some function 
/ is as follows: 

f(A) = ^2f(at)m\. (3.86) 

i 

Exercise 3.3.10 Show that the rotation operators Rx{4>), Ry{4>), Rz{<ft) are equivalent to 
the following expressions: 



R x ((j>) = cos(0/2)/ + i sm((f)/2)X, 
R Y ((P) = cos((f)/2)I + * sin(0/2)F, 
R z {(j)) = cos((/}/2)I + % sin(0/2)Z, 



by using the facts that 



cos(0/2) 
sin(0/2) 



2 

1 

2~i 



i<t>/2 ^ 



-i<j>/2\ 



-i<t>/2 



)• 



(3.87) 
(3.88) 
(3.89) 



(3.90) 
(3.91) 



Figure |3.4| provides a more detailed picture of the Bloch sphere since we have now es- 
tablished the Pauli operators and their eigenstates. The computational basis states are the 
eigenstates of the Z operator and are the north and south poles on the Bloch sphere. The 



+/— basis states are the eigenstates of the X operator and the calculation from Exercise 3.2.1 
shows that they are the "east and west poles" of the Bloch sphere. We leave it as another 
exercise to show that the Y eigenstates are the other poles along the equator of the Bloch 
sphere. 

Exercise 3.3.11 Determine the Bloch sphere angles 9 and <p for the eigenstates of the Pauli 
Y operator. 



3.4 Measurement 



Measurement is another type of evolution that a quantum system can undergo, and this 
type of evolution is unique to the quantum theory. It is an evolution that allows us to 
retrieve classical information from a quantum state and thus is the way that we can "read 
out" information. Suppose that we would like to learn something about the quantum state 
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m 



Figure 3.5: The above figure depicts our diagram of a quantum measurement. Thin lines denote quantum 
information and thick lines denote classical information. The result of the measurement is to output a 
classical variable m according to a probability distribution governed by the Born rule of the quantum theory. 



\ijj) in (3.2). Nature prevents us from learning anything about the probability amplitudes 
a and if we have only one quantum measurement that we can perform. Nature only 
allows us to measure observables. Observables are physical variables such as the position 
or momentum of a particle. In the quantum theory, we represent observables as Hermitian 
operators because their eigenvalues are real numbers and every measuring device outputs a 
real number. Examples of qubit observables that we can measure are the Pauli operators X, 
Y, and Z. 

Suppose we measure the Z operator. This measurement is called a "measurement in the 
computational basis" or a "measurement of the Z observable" because we are measuring the 
eigenvalues of the Z operator. The measurement postulate of the quantum theory, also known 
as the Born rule, states that the system "collapses" into the state |0) with probability |«| 
and collapses into the state |1) with probability \0\ . That is, the resulting probabilities are 
the squares of the probability amplitudes. After the measurement, our measuring apparatus 
tells us whether the state collapsed into |0) or |1) — it returns +1 if the resulting state is |0) 
and returns —1 if the resulting state is |1). These returned values are the eigenvalues of the 
Z operator. The measurement postulate is the aspect of the quantum theory that makes it 
probabilistic or "jumpy" and is part of the "strangeness" of the quantum theory. Figure 3J5 
depicts the notation for a measurement that we will use in diagrams throughout this book. 

What is the result if we measure the state \ip) in the +/— basis? Consider that we 
can represent \ip) as a superposition of the |+) and |— ) states, as given in (3.35). The 
measurement postulate then states that a measurement of the X operator gives the state 
|+) with probability |a + /5| /2 and the state | — ) with probability \a — 0\ /2. Quantum 
interference is now playing a role because the amplitudes a and interfere with each other. 
So this effect plays an important role in quantum information theory. 

In some cases, the basis states |0) and |1) may not represent the spin states of an electron, 
but may represent the location of an electron. So, a way to interpret this measurement 
postulate is that the electron "jumps into" one location or another depending on the outcome 
of the measurement. But what is the state of the electron before the measurement? We will 
just say in this book that it is in a superposed, indefinite, or unsharp state, rather than 
trying to pin down a philosophical interpretation. Some might say that the electron is in 
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Quantum State 


Probability of +} 


Probability of — } 




Superposition state 
Probabilistic description 


\a + (3\ z /2 
1/2 


\a-(3\ z /2 
1/2 





Table 3.1: The above table summarizes the differences in probabilities for a quantum state in a superposition 
a|0) + /3|1) and a classical state that is a probabilistic mixture of |0) and |1). 

"two different locations at the same time." 

Also, we should stress that we cannot interpret this measurement postulate as meaning 
that the state is in |0) or |1) with respective probabilities |a| and \j3\ before the measurement 
occurs, because this latter scenario is completely classical. The superposition state a|0)+/?|l) 
gives fundamentally different behavior from the probabilistic description of a state that is in 
|0) or |1) with respective probabilities |a| and \j3\ . Suppose that we have the two different 
descriptions of a state (superposition and probabilistic) and measure the Z operator. We get 
the same result for both cases — the resulting state is |0) or |1) with respective probabilities 
\a\ and \j3\ . 

But now suppose that we measure the X operator. The superposed state gives the 
result from before — we get the state |+) with probability |a + 0\ /2 and the state | — ) with 
probability \a — 0\ /2. The probabilistic description gives a much different result. Suppose 
that the state is |0). We know that |0) is a uniform superposition of |+) and | — }: 

10} = l+) ^ H - (3.92) 

So the state collapses to |+) or |— ) with equal probability in this case. If the state is |1), 
then it collapses again to |+) or | — } with equal probabilities. Summing up these probabil- 
ities, it follows that a measurement of the X operator gives the state |+) with probability 
(|a| + \j3\ )/2 = 1/2 and gives the state | — } with the same probability. These results are 
fundamentally different from those where the state is the superposition state \tp), and experi- 
ment after experiment supports the predictions of the quantum theory. Table 3.1 summarizes 
the results described in the above paragraph. 

Now we consider a "Stern-Gerlach" like argument to illustrate another example of funda- 
mental quantum behavior [101J . The Stern-Gerlach experiment was a crucial one for deter- 
mining the "strange" behavior of quantum spin states. Suppose we prepare the state |0). If 
we measure this state in the Z basis, the result is that we always obtain the state |0) because 
it is a definite Z eigenstate. Suppose now that we measure the X operator. The state |0) 
is equivalent to a uniform superposition of |+) and | — ). The measurement postulate then 
states that we get the state |+) or | — } with equal probability after performing this measure- 
ment. If we then measure the Z operator again, the result is completely random. The Z 
measurement result is |0) or |1) with equal probability if the result of the X measurement is 
|+) and the same distribution holds if the result of the X measurement is | — ). This argument 
demonstrates that the measurement of the X operator throws off the measurement of the Z 
operator. The Stern-Gerlach experiment was one of the earliest to validate the predictions 
of the quantum theory. 
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3.4.1 Probability, Expectation, and Variance of an Operator 

We have an alternate, more formal way of stating the measurement postulate that turns 
out to be more useful for a general quantum system. Suppose that we are measuring 
the Z operator. The diagonal representation of this operator is 

Z = |0>(0|-|1>(1|. (3.93) 

Consider the operator 

n = |0}(0|. (3.94) 

It is a projection operator because applying it twice has the same effect as applying it 
once: IIq = n . It projects onto the subspace spanned by the single vector |0). A similar 
line of analysis applies to the projection operator 

n 1 = |l)(l|. (3.95) 

So we can represent the Z operator as n — III- Performing a measurement of the Z operator 
is equivalent to asking the question: Is the state |0) or |1)? Consider the quantity (^|IIo|^>): 

(^|n |^) = (^|0>(0|^> = a*a = \a\ 2 . (3.96) 



A similar analysis demonstrates that 



|IW> = \P\ 2 . (3.97) 



These two quantities then give the probability that the state collapses to |0) or |1). 

A more general way of expressing a measurement of the Z basis is to say that we have 
a set {IIj} iG r ji of measurement operators that determine the outcome probabilities. These 
measurement operators also determine the state that results after the measurement. If the 
measurement result is +1, then the resulting state is 

n °'^ ) = |0), (3.98) 



n<#> 

where we implicitly ignore the irrelevant global phase factor A. If the measurement result 
is —1, then the resulting state is 



vW) 



|1), (3.99) 



where we again implicitly ignore the irrelevant global phase factor JL. Dividing by y/ (ip\Ili\ijj) 
for i = 0, 1 ensures that the state resulting after measurement corresponds to a physical state 
that has unit norm. 

We can also measure any orthonormal basis in this way — this type of projective mea- 
surement is called a von Neumann measurement. For any orthonormal basis {\4>i)} iG r u, 
the measurement operators are {\<f>i)(<fii\} ie f 01 \, and the state collapses to |0i)(0i|^)/|(</>i|^7| 
with probability (^|0i}(0;|^} = |(0;|^}| 2 . 
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Exercise 3.4.1 Determine the set of measurement operators corresponding to a measure- 
ment of the X observable. 

We might want to determine the expected measurement result when measuring the Z 
operator. The probability of getting the +1 value corresponding to the |0) state is \a\ and 
the probability of getting the —1 value corresponding to the —1 eigenstate is \/3\ . Standard 
probability theory then gives us a way to calculate the expected value of a measurement of 
the Z operator when the state is \i/j): 

E[Z] = \a\ 2 (l) + \(3\ 2 (-l) (3.100) 

= \a\ 2 -\(3\ 2 . (3.101) 

We can formulate an alternate way to write this expectation, by making use of the Dirac 
notation: 

E[Z} = H 2 (l) + |/3| 2 (-1) (3.102) 

= MiW) + Mn#)(-i) (3.103) 

= (^|n -iW) (3.104) 

= (WW) (3.105) 

It is common for physicists to denote the expectation as 

(Z) = (WW), (3-106) 

when it is understood that the expectation is with respect to the state \ip). This type of 
expression is a general one and the next exercise asks you to show that it works for the X 
and Y operators as well. 

Exercise 3.4.2 Show that the expressions (if)\X \if)) and (if)\Y \if)) give the respective expec- 
tations ELY] and E[Y] when measuring the state \ip) in the respective X and Y basis. 

We also might want to determine the variance of the measurement of the Z operator. 
Standard probability theory again gives that 

Var[Z] = E[Z 2 ] - E[Z] 2 . (3.107) 

Physicists denote the standard deviation of the measurement of the Z operator as 

AZ = ((Z - (Z)) 2 ) 1/2 , (3.108) 

and thus the variance is equal to (AZ) . Physicists often refer to AZ as the uncertainty of 
the observable Z when the state is \ip). 

In order to calculate the variance Var[Z], we really just need the second moment E[Z 2 ] 
because we already have the expecation E[Z\. 

E[Z 2 ] =|a| 2 (l) 2 + |/3| 2 (-l) 2 (3.109) 

= \a\ 2 + \p\ 2 . (3.110) 

We can again calculate this quantity with the Dirac notation. The quantity (ip\Z 2 \ip) is the 
same as E[Z 2 ] and the next exercise asks you for a proof. 
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Exercise 3.4.3 Show that E[X 2 } = (V'l^ 2 ^), E[F 2 ] = (V'l^ 2 ^), and E[Z 2 } = (V>|Z 2 |^). 

3.4.2 The Uncertainty Principle 

The uncertainty principle is a fundamental aspect of the quantum theory. In the case of 
qubits, one instance of the uncertainty principle gives a lower bound on the product of the 
uncertainty of the Z operator with the uncertainty of the X operator: 

AZAX>^|[X,Z]|V)|. (3.111) 

We can prove this principle using the postulates of the quantum theory. Let us define the 
operators Zq = Z — (Z) and Xq = X — (X). First, consider that 

AZAX = ^\Zl\^f 2 ^\Xl\^) 1/2 (3.112) 

> MZ X \i;)\ (3.113) 

The above step follows by applying the Cauchy-Schwarz inequality to the vectors Xo\ip) and 
Z \ip). For any operator A, we define its real part Re{A} as 

A + A ] , 

Re{^} = — - — , (3.114) 



and its imaginary part Im{A} as 



Im{^} = AA (3.115) 



so that 

A = Re{A} + iIm{A}. (3.116) 

So the real and imaginary parts of the operator Z Q X are 

Re{Z X } = ZoX ° + XoZ ° ee iA^ (3.117) 

lm{Z X } = ZoXo : XoZ ° ee AA1 (3.118) 
Zi Zi 

where {Z ,X } is the anticommutator of Z and X and [Z ,X ] is the commutator of the 

two operators. We can then express the quantity |(-*/>|Z o Xo|'0}| in terms of the real and 
imaginary parts of ZqXq. 

MZ X \ij)\ = \(ij\Re{Z X }\ij) +i(ij\Im{Z X }m (3.119) 

>|(^|Im{Z Xo}|^}| (3.120) 

= M[Z ,X }\4,)\/2 (3.121) 

= MlZ,XM\/2 (3.122) 
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The first equality follows by substitution, the first inequality follows because the magnitude 
of any complex number is greater than the magnitude of its imaginary part, the second 



equality follows by substitution with (3.118), and the third equality follows by the result of 
Exercise 13.4.41 below. 

The commutator of the operators Z and X arises in the lower bound, and thus, the 
non-commutativity of the operators Z and X is the fundamental reason that there is an 
uncertainty principle for them. Also, there is no uncertainty principle for any two operators 
that commute with each other. 

Exercise 3.4.4 Show that [Z , X ] = [Z,X] and that [Z,X] = -2iY. 



Exercise 3.4.5 The uncertainty principle in (3.111) has the property that the lower bound 



has a dependence on the state \i/j). Find a state \ip) for which the lower bound on the 
uncertainty product AXAZ vanishesjj 

3.5 Composite Quantum Systems 

A single physical qubit is an interesting physical system that exhibits uniquely quantum 
phenomena, but it is not particularly useful on its own (just as a single classical bit is not 
particularly useful for classical communication or computation). We can only perform inter- 
esting quantum information processing tasks when we combine qubits together. Therefore, 
we should have a way for describing their behavior when they combine to form a composite 
quantum system. 

Consider two classical bits Cq and c\. In order to describe bit operations on the pair of 
cbits, we write them as an ordered pair (ci,c ). The space of all possible bit values is the 
Cartesian product Z 2 x Z 2 of two copies of the set Z 2 = {0, 1}: 

Z 2 x Z 2 = {(0, 0), (0, 1), (1, 0), (1, 1)}. (3.123) 

Typically, we make the abbreviation ciCn. = (ci, cq) when representing cbit states. 

We can represent the state of two cbits with particular states of qubits. For example, we 
can represent the two-cbit state 00 with the following mapping: 

00^|0}|0). (3.124) 

Many times, we make the abbreviation |00) = |0)|0) when representing two-cbit states with 
qubits. In general, any two-cbit state C\Cq has the following representation as a two-qubit 
state: 

cic -> |ciCq). (3.125) 



Do not be alarmed by the result of this exercise! The usual formulation of the uncertainty principle only 
gives a lower bound on the uncertainty product. This lower bound never vanishes for the case of position 
and momentum because the commutator of these two observables is equal to the identity operator multiplied 
by i, but it can vanish for the operators given in the exercise. 
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The above qubit states are not the only possible states that can occur in the quantum 
theory. By the superposition principle, any possible linear combination of the set of two-cbit 
states is a possible two-qubit state: 



|O = a|00) + /3|01)+ 7 |10) + <J|ll). 



(3.126) 



The unit-norm condition \a\ + \/3\ + | 7 | + \S\ =1 again must hold for the two-qubit 
state to correspond to a physical quantum state. It is now clear that the Cartesian product 
is not sufficient for representing two-qubit quantum states because it does not allow for 
linear combinations of states (just as the mathematics of Boolean algebra is not sufficient to 
represent single-qubit states). 

We again turn to linear algebra to determine a representation that suffices. The tensor 
product is the mathematical operation that gives a sufficient representation of two-qubit 
quantum states. Suppose we have two two-dimensional vectors: 



61 



The tensor product of these two vectors is 



«2 
bo 



(3.127) 











a 2 






O1O2 


ai 




a? 




ai 


b,. 






aih 


61 


& 


b 2 
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h 


a 2 






ha 2 








. fo 2 . 






. & 1 & 2 



(3.128) 



Recall, from (3.4 3.5), the vector representation of the single-qubit states |0) and |1). 



Using these vector representations and the above definition of the tensor product, the two- 
qubit basis states have the following vector representations: 



|00) 



1 







|01) 





1 






|10} 







1 





111) 







1 



(3.129) 



A simple way to remember these representations is that the bits inside the ket index the 
element equal to one in the vector. For example, the vector representation of |01) has a 
one as its second element because 01 is the second index for the two-bit strings. The vector 
representation of the superposition state in (|3.126|) is 



a 

P 
7 
5 



(3.130) 



There are actually many different ways that we can write two-qubit states, and we go 
through all of these right now. Physicists have developed many shorthands, and it is impor- 
tant to know each of these because they often appear in the literature (we even use different 
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notations depending on the context). We may use any of the following two-qubit notations 
if the two qubits are local to one party and only one party is involved in a protocol: 



a|0) ® |0> +P\0) ® |1) + 7 |1) ® |0> + <5|1> <8 |1), 
a|0)|0)+/5|0)|l)+ 7 |l)|0) + (5|l)|l), 
a|00)+/?|01}+ 7 |10}+5|ll). 



(3.131) 
(3.132) 
(3.133) 



We can put labels on the qubits if two or more parties are involved in the protocol: 

a\0) A ® \0) B + P\0) A ® |1) S + 7|1} A ® |0) B + 5|1) A <g) |1} B , (3.134) 

«|0} A |0) S + /?|0} A |1) B + 7 |1} A |0) B + 5|1} A |1) B , (3.135) 

a|00) AB + /3|01} AB + 7|10} AB + 5\11) AB . (3.136) 

This second scenario is different from the first scenario because two spatially separated 
parties share the two-qubit state. If the state has quantum correlations, then it can be 



valuable as a communication resource. We go into more detail on this topic in Section 3.5.6 
on entanglement. 



3.5.1 Evolution of Composite Systems 

The postulate on unitary evolution extends to the two-qubit scenario as well. First, let us 
establish that the tensor product A <g> B of two operators A and B is 



A® B 



On ai2 
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(3.137) 



(3.138) 



(3.139) 



Consider the two-qubit state in (3.126). We can perform a NOT gate on the first qubit 
so that it changes to 

a|10)+/?|ll}+7|00}+o"|01). (3.140) 

We can likewise flip its second qubit: 

a|01)+/?|00}+7|ll}+o"|10), (3.141) 

or flip both at the same time: 

«|11}+ / 5|10}+7|01}+5|00). (3.142) 
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(c) 



Figure 3.6: The above figure depicts circuits for the example two-qubit unitaries X1I2, I1X2, and X\Xi- 



Figure 3.6 depicts quantum circuit representations of these operations. These are all re- 



versible operations because applying them again gives the original state in (3.126). In the 



first case, we did nothing to the second qubit, and in the second case, we did nothing to the 
first qubit. The identity operator acts on the qubits that have nothing happen to them. 

Let us label the first qubit as "1" and the second qubit as "2." We can then label the 
operator for the first operation as X\I 2 because this operator flips the first qubit and does 
nothing (applies the identity) to the second qubit. We can also label the operators for the 
second and third operations respectively as I\X 2 and XiX 2 . The matrix representation of 
the operator X\I 2 is the tensor product of the matrix representation of X with the matrix 
representation of I — this relation similarly holds for the operators I\X 2 and X\X 2 . We show 
that it holds for the operator X\I 2 and ask you to verify the other two cases. We can use 
the two-qubit computational basis to get a matrix representation for the two-qubit operator 
X X I 2 : 

(00|XiI 2 |00) (00|XiJ 2 |01) (00 \X X I 2 | 10) (00|XiI 2 |ll) ' 

(01 \X X I 2 1 00) (01|Xi/ 2 |01) (01 \X x h 1 10) (01|XiI 2 |ll) 

(10|XiI 2 |00) (10|XiJ 2 |01) (lOLY^lO) (10|XiI 2 |ll) 

(llLYx/alOO) (lllMlOl) (llLYx/^lO) (11^! J 2 |ll) _ 

(00|10) (00|11) (00|00) (00|01) 

(01|10) (01|11) (01|00) <01|01> 

(10|10) (10|11) (10|00) <10|01) 

(11|10> (lljll) (11|00) (n|oi) 









1 


" 











1 


1 














1 









(3.143) 



This last matrix is equal to the tensor product X <8) / by inspecting the definition of the 



tensor product for matrices in (3.137). 



Exercise 3.5.1 Show that the matrix representation of the operator I\X 2 is equal to the 
tensor product I <S> X . Show the same for X\X 2 and X ® X . 

3.5.2 Probability Amplitudes for Composite Systems 

We relied on the orthogonality of the two-qubit computational basis states for evaluating 
amplitudes such as (00| 10) or (00|00) in the above matrix representation. It turns out that 
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there is another way to evaluate these amplitudes that relies only on the orthogonality of the 
single-qubit computational basis states. Suppose that we have four single-qubit states \4>o), 
1 0i ) , l^o); l^i); an d we make the following two-qubit states from them: 



|0o) ® |Vo>, (3-144) 

\<h) ® IV'i). (3-145) 

We may represent these states equally well as follows: 

|0o, ^o), (3-146) 

\<hM. (3.147) 

because the Dirac notation is versatile (virtually anything can go inside a ket as long as its 
meaning is not ambiguous). The bra (0i, t/)i\ is dual to the ket |0i, ifii), and we can use it to 
calculate the following amplitude: 

(0i,-0i|0o,-0o)- (3-148) 

This amplitude is equivalent to the multiplication of the single-qubit amplitudes: 

(faMMo) = ((f>i\<h)M^o)- (3-149) 

Exercise 3.5.2 Verify that the amplitudes {{ij\kl)} i j k i e / 01 \ are respectively equal to the 



amplitudes {(*|&)(i|0}iifcze{0 i>- By linearity, this exercise justifies the relation in (3.149) 
(at least for two-qubit states). 

3.5.3 Controlled Gates 

An important two-qubit unitary evolution is the controlled-NOT (CNOT) gate. We consider 
its classical version first. The classical gate acts on two cbits. It does nothing if the first bit 
is equal to zero, and flips the second bit if the first bit is equal to one: 

00^00, 01->01, 10 ^ 11, 11 ^ 10. (3.150) 

We turn this gate into a quantum gatqjby demanding that it act in the same way on the 
two-qubit computational basis states: 

|00) -> |00), |01) -> |01), |10) -> |11), |11) -> |10). (3.151) 

This behavior carries over to superposition states as well: 

a|00)+/?|01}+7|10} + £|ll) CNOT a|00> + (3\01) + 7|11) + 5\10). (3.152) 



5 There are other terms for the action of turning a classical operation into a quantum one. Some examples 
are "making it coherent," "coherifying," or the quantum gate is a "coherification" of the classical one. The 
term "coherify" is not a proper English word, but we will use it regardless at certain points. 
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Figure 3.7: The above figure depicts the circuit diagrams that we use for (a) a CNOT gate and (b) a 
controlled-C/ gate. 



A useful operator representation of the CNOT gate is 

CNOT = |0)<0|<g> J+|l)(l|<g>X 



(3.153) 



The above representation truly captures the coherent quantum nature of the CNOT gate. In 
the classical CNOT gate, we can say that it is a conditional gate, in the sense that the gate 
applies to the second bit conditional on the value of the first bit. In the quantum CNOT gate, 
the second operation is controlled on the basis state of the first qubit (hence the choice of the 
name "controlled-NOT"). That is, the gate always applies the second operation regardless 
of the actual qubit state on which it acts. 



A controlled-t/ gate is similar to the CNOT gate in (3.153). It simply applies the 



unitary U (assumed to be a single-qubit unitary) to the second qubit, controlled on the 
first qubit: 

Controlled-^/ = |0)(0| <g> 1 + |1)(1| ® U. (3.154) 

The control qubit can be controlled with respect to any orthonormal basis {|0o) 5 |<^i)} : 



i o )(0o|®J+|0i)(0i|®tf 



(3.155) 



Figure 3.7 depicts the circuit diagrams for a controlled-NOT and controlled-C/ operation. 



Exercise 3.5.3 Verify that the matrix representation of the CNOT gate in the computa- 
tional basis is 

"10 

10 
1 
10 



(3.156) 



Exercise 3.5.4 Consider applying Hadamards to the first and second qubits before and 
after a CNOT acts on them. Show that this gate is equivalent to a CNOT in the +/— basis 
(recall that the Z operator flips the +/— basis): 

#i# 2 CNOT H 1 H 2 = |+X+| ® J + |-)(-| ® Z. (3.157) 

Example 3.5.1 Show that two CNOT gates with the same control qubit commute. 

Exercise 3.5.5 Show that two CNOT gates with the same target qubit commute. 
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3.5.4 The No Cloning Theorem 

The no cloning theorem is one of the simplest results in the quantum theory, yet it has some 
of the most profound consequences. It states that it is impossible to build a universal copier 
of quantum states. A universal copier would be a device that could copy any arbitrary 
quantum state that is input to it. It may be surprising at first to hear that copying quantum 
information is impossible because copying classical information is ubiquitous. 

We give a simple proof for the no-cloning theorem. Suppose for a contradiction that there 
is a two-qubit unitary operator U acting as a universal copier of quantum information. That 
is, if we input an arbitrary state \ip) = a\0) + j3\l) as the first qubit and input an ancilla 
qubit |0) as the second qubit, it "writes" the first qubit to the second qubit slot as follows: 

U\if>)\0) = \tPM (3.158) 

= (a\Q) + p\l))(a\0) + p\l)) (3.159) 

= a 2 |0)|0) + a/3\0)\l) + a/3|l)|0) +/3 2 \l)\l). (3.160) 

The copier is universal, meaning that it copies an arbitrary state. In particular, it also copies 
the states |0) and |1): 

tf|0>|0) = |0)|0), (3.161) 

U\1)\0) = \1)\1). (3.162) 

Linearity of the quantum theory then implies that the unitary operator acts on a superpo- 
sition a\0) + /3\1) as follows: 

U(a\0) + p\l))\0) = a|0>|0> +/?|1>|1>. (3.163) 



The result in (3.160) contradicts the result in (3.163) because these two expressions do not 



have to be equal for all a and j3: 

3a,p: « 2 |0)|0} + a/3|0)|l)+«/3|l)|0}+/3 2 |l)|l) ^ a|0)|0) + /5|1)|1). (3.164) 

Thus, unitarity in the quantum theory contradicts the existence of a universal quantum 
copier. 

We would like to stress that this proof does not mean that it is impossible to copy certain 
quantum states — it only implies the impossibility of a universal copier. Another proof of 
the no-cloning theorem gives insight into the type of states that we can copy. Let us again 
suppose that a universal copier U exists. Consider two arbitrary states \ip) and \(f>). If a 
universal copier U exists, then it performs the following copying operation for both states: 

U\,J>)\0) = \ii>m, (3.165) 

U\<j>)\0) = \<f>)\<f>). (3.166) 

Consider the probability amplitude (if)\(ij)\\(f))\(f)): 

mmm = w#><v#> = w> 2 - (3.167) 
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The following relation for (ifj\(tjj\\(f))\(f)) holds as well by using the results in (3.165) and the 
unitarity property U*U = I: 

mnm) = mm ] u\w) (3.168) 

= ^|(O||0}|O) (3.169) 

= (^|0)(O|O) (3.170) 

= W). (3-171) 



It then holds that 



mnm) = w) 2 = (w>, (3.172) 



by employing the above two results. The relation (ij)\(f)) = ("010} holds for exactly two 
cases, (if)\4>) = 1 and {ip\4>) = 0. The first case holds only when the two states are the same 
state and the second case holds when the two states are orthogonal to each other. Thus, it is 
impossible to copy quantum information in any other case because we would again contradict 
unitarity. 

The no-cloning theorem has several applications in quantum information processing. 
First, it underlies the security of the quantum key distribution protocol because it ensures 
that an attacker cannot copy the quantum states that two parties use to establish a secret 
key. It finds application in quantum Shannon theory because we can use it to reason about 
the quantum capacity of a certain quantum channel known as the erasure channel. We will 



return to this point in Chapter 23 



Exercise 3.5.6 Suppose that two states \tp) and IV^) are orthogonal: 

(^- L > = 0. (3.173) 

Construct a two-qubit unitary that can copy the states, i.e., find a unitary U that acts as 
follows: 

t/|V>|0) = hW> ; (3-174) 

^|^- L )|0> = |^ J -)|^- L >. (3.175) 

Exercise 3.5.7 (No-deletion theorem) Related to the no-cloning theorem, there is a no- 
deletion theorem. Suppose that two copies of a quantum state \tf)) are available, and the 
goal is to delete one of these states by a unitary interaction. That is, there should exist a 
universal quantum deleter U that has the following action on the two copies of \ifi) and an 
ancilla state \A), regardless of the input state \tp): 

U\^)\ij)\A) = \ij)\0)\A'). (3.176) 

Show that this is impossible. 
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3.5.5 Measurement of Composite Systems 

The measurement postulate also extends to composite quantum systems. Suppose again 



that we have the two-qubit quantum state in (3.126). By a straightforward analogy with the 



single-qubit case, we can determine the following amplitudes: 

(OO|0 = a, (Ol|0=fr (10|0 = 7 , <11|0 = *. (3-177) 

We can also define the following projection operators 

noo = |oo)(oo|, n i = |oi)(oi|, n 10 = |io)(io|, n 00 = |n)(n|, (3.178) 

and apply the Born rule to determine the probabilities for each result: 

(0n oo |0 = \a\ 2 , (0n ol |0 = \p\\ (0n lo |0 = M 2 , (0n n |0 = \s\ 2 . (3.179) 

Suppose that we wish to perform a measurement of the Z operator on the first qubit 
only. What is the set of projection operators that describes this measurement? The answer 
is similar to what we found for the evolution of a composite system. We apply the iden- 
tity operator to the second qubit because no measurement occurs on it. Thus, the set of 
measurement operators is 

{n ®/,ni®/}, (3.180) 



where the definition of flo and IT is in (3.94 3.95). The state collapses to 



(n ®/)|e> a|00}+/?|01) 



2 



V^I(no®/)IO ^\a\ 2 + \(3\ 

with probability (0(II o <8> 7)|0 = \cx\ + \/3\ , and collapses to 

(IT® /)|0 7 |10)+<5|11) 



(3.181) 



VWh^W) ^T^i 



2 



(3.182) 



with probability (0(IT ® 7)|0 = |7| 2 +|<5| 2 . The divisions by V(£l( n o ® 010 and ^(f |(IIi <g> J)|f> 
again ensure that the resulting state is a normalized, physical state. 

3.5.6 Entanglement 

Composite quantum systems give rise to the most uniquely quantum phenomenon: entangle- 
ment. Schrodinger first observed that two or more quantum systems can be entangled and 
coined the term after noticing some of the bizarre consequences of this phenomenonjj 



6 Schrodinger actually used the German word "Verschrankung" to describe the phenomenon, which liter- 
ally translates as "little parts that, though far from one another, always keep the exact same distance from 
each other." The one-word English translation is "entanglement." Einstein described the "Verschrankung" as 
a "spukhafte Fernwirkung," most closely translated as "long-distance ghostly effect" or the more commonly 
stated "spooky action at a distance." 
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We first consider a simple, unentangled state that two parties, Alice and Bob, may share, 
in order to see how an unentangled state contrasts with an entangled state. Suppose that 
they share the state 

|0} A |0) B , (3.183) 

where Alice has the qubit in system A and Bob has the qubit in system B. Alice can 
definitely say that her qubit is in the state |0) and Bob can definitely say that his qubit is 
in the state |0) . There is nothing really too strange about this scenario. 
Now, consider the composite quantum state |$ + ) : 

AB = |o)»'-Hi)'|i)' 

1 ' y/2 

Alice again has possession of the first qubit in system A and Bob has possession of the second 
qubit in system B. But now, it is not clear from the above description how to determine the 
individual state of Alice or the individual state of Bob. The above state is really a uniform 
superposition of the joint state |0) |0) and the joint state |1) |1) , and it is not possible 
to describe either Alice's or Bob's individual state in the noiseless quantum theory. We also 
cannot describe the entangled state |$ + ) as a product state of the form \4>) \ip) . This 
criterion is one possible criterion to determine if a state is entangled, but we will see a more 
formal criterion in the coming section. 

Exercise 3.5.8 Show that the entangled state |$ + ) J has the following representation in 

the +/— basis: 

I . \A\ . \B I \A< \B 

\^) AB = l+) l+) +'-> '-> . (3.185) 

V2 



Figure |3.8| gives a graphical depiction of entanglement. We use this depiction often 
throughout this book. Alice and Bob must receive the entanglement in some way, and the 
diagram indicates that some source distributes the entangled pair to them. It indicates that 
Alice and Bob are spatially separated and they possess the entangled state after some time. 



If they share the entangled state in (3.184), we say that they share one bit of entanglement, 



or one ebit. The term "ebit" implies that there is some way to quantify entanglement and 



we will make this notion clear in Chapter 18. 



Entanglement as a Resource 

In this book, we are interested in the use of entanglement as a resource. Much of this book 
concerns the theory of quantum information processing resources and we have a standard 
notation for the theory of resources. Let us represent the resource of a shared ebit as 

[qq], (3.186) 

meaning that the ebit is a noiseless, quantum resource shared between two parties. Square 
brackets indicate a noiseless resource, the letter q indicates a quantum resource, and the two 
copies of the letter q indicate a two-party resource. 
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Figure 3.8: We use the above diagram to depict entanglement shared between two parties A and B. 
The diagram indicates that a source location creates the entanglement and distributes one system (the red 
system) to A and the other system (the blue system) to B. The standard unit of entanglement is the ebit 
in a Bell state |*+) = (|00) AB + |11) AB )/V2. 



Our first example of the use of entanglement is its role in generating common randomness. 
We define one bit of common randomness as the following probability distribution for two 
binary random variables Xa and Xb' 

Px A ,x B (x A ,x B ) = -5(x A ,x B ), (3.187) 

where 5 is the Kronecker delta function. Suppose Alice possesses random variable Xa and 
Bob possesses random variable Xb- Thus, with probability 1/2, they either both have a zero 
or they both have a one. We represent the resource of one bit of common randomness as 

[cc], (3.188) 

indicating that a bit of common randomness is a noiseless, classical resource shared between 
two parties. 

Now suppose that Alice and Bob share an ebit and they decide that they will each 
measure their qubits in the computational basis. Without loss of generality, suppose that 
Alice performs a measurement first. Thus, Alice performs a measurement of the Z A operator, 
meaning that she measures Z A <g> I B (she cannot perform anything on Bob's qubit because 
they are spatially separated). The projection operators for this measurement are the same 



from (3.180) and just before Alice looks at her measurement result, we can describe the 



system as being in the following ensemble of states: 

lO^lO) 5 with probability i (3.189) 

|1} A |1) B with probability -. (3.190) 

The interesting thing about the above ensemble is that Bob's result is already determined 
even before he measures, just after Alice looks at her result. Suppose that Alice knows the 
result of her measurement is |0) . When Bob measures his system, he obtains the state |0) 
with probability one and Alice knows that he has measured this result. Additionally, Bob 
knows that Alice's state is |0) if he obtains |0) . The same results hold if Alice knows that 
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the result of her measurement is |1) . Thus, this protocol is a method for them to generate 



one bit of common randomness as defined in (3.187). 



We can phrase the above protocol as the following resource inequality: 

[qq] > [cc\. (3.191) 

The interpretation of the above resource inequality is that there exists a protocol which 
generates the resource on the right by consuming the resource on the left, and for this reason, 
the resource on the left is a stronger resource than the one on the right. The theory of resource 
inequalities plays a prominent role in this book and is a useful shorthand for expressing 
quantum protocols. 

A natural question is to wonder if there exists a protocol to generate entanglement from 
common randomness. It is not possible to do so and the reason for this inequivalence of 
resources is another type of inequality (different from the resource inequality mentioned 
above), called a Bell's inequality. In short, Bell's theorem places an upper bound on the 
correlations present in any two classical systems. Entanglement violates this inequality, 
showing that it has no known classical equivalent. Thus, entanglement is a strictly stronger 



resource than common randomness and the resource inequality in (3.191) only holds in the 
given direction. 

Common randomness is a resource in classical information theory, and may be useful 
in some scenarios, but it is actually a rather weak resource. Surely, generating common 
randomness is not the only use of entanglement. It turns out that we can construct far more 
exotic protocols such as the teleportation protocol or the super-dense coding protocol by 
combining the resource of entanglement with other resources. We discuss these protocols in 
Chapter [6j 



Exercise 3.5.9 Use the representation of the ebit in Exercise 3.5.8 to show that Alice and 
Bob can measure the X operator to generate common randomness. This ability to obtain 
common randomness by both parties measuring in either the Z or X basis is the basis for 
an entanglement-based secret key distribution protocol. 

Exercise 3.5.10 (Cloning implies signaling) Prove that if a universal quantum doner 
were to exist, then it would be possible for Alice to signal to Bob faster than the speed of 
light by exploiting only the ebit state |<f> + ) shared between them and no communication. 
That is, show the existence of a protocol that would allow for this. (Hint: One possibility 
is for Alice to measure the X or Z Pauli operator locally on her share of the ebit, and then 
for Bob to exploit the universal quantum doner. Consider the representation of the ebit in 



(3.184) and (3.185). 



Entanglement in the CHSH Game 

One of the simplest means for demonstrating the power of entanglement is with a two-player 
game known as the CHSH game (after Clauser, Home, Shimony, and Holt). We first present 
the rules of the game, and then we find an upper bound on the probability that players 
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Figure 3.9: A depiction of the CHSH game. The referee distributes the bits x and y to Alice and Bob in 
the first round. In the second round, Alice and Bob return the bits a and b to the referee. 



sharing classical correlations can win. We finally leave it as an exercise to show that players 
sharing a maximally entangled Bell state |$ + ) can have an approximately 10% higher chance 
of winning the game with a quantum strategy. 

The players of the game are Alice and Bob. The game begins with a referee selecting two 
bits x and y uniformly at random. He then sends x to Alice and y to Bob. Alice and Bob 
are not allowed to communicate in any way at this point. Alice sends back to the referee a 
bit a, and Bob sends back a bit b. Since they are spatially separated, Alice's bit a can only 
depend on x, and similarly, Bob's bit b can only depend on y. The referee then determines 
if the AND of x and y is equal to the exclusive OR of a and b. If so, then Alice and Bob 
win the game. That is, the winning condition is 



x A y = a © b. 



(3.192) 



Figure |3.9| depicts the CHSH game. 

Before the game begins, Alice and Bob are allowed to coordinate on a strategy. A 
deterministic strategy would have Alice select a bit a x conditioned on the bit x that she 
receives, and similarly, Bob would select a bit b y conditioned on y. The following table 
presents the winning conditions for the four different values of x and y with this deterministic 
strategy: 



(3.193) 



X 


y 


x Ay 


— a x © by 











= a (Bb 





i 





= a © &i 


1 








= ax © b 


1 


1 


1 


= a\ © b\ 
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Though, we can observe that it is impossible for them to always win. If we add the entries 
in the column x Ay, the binary sum is equal to one, while if we add the entries in the column 
= a x © by, the binary sum is equal to zero. Thus, it is impossible for all of these equations 
to be satisfied. At most, only three out of four of them can be satisfied, so that the maximal 
winning probability with a classical deterministic strategy is at most 3/4. This upper bound 
also serves as an upper bound on the winning probability for the case in which they employ 
a randomized strategy coordinated by shared randomness — any such strategy would just be 
a convex combination of deterministic strategies. We can then see that a strategy for them 
to achieve this upper bound is for Alice and Bob always to return a = and b = no matter 
the values of x and y. 

Interestingly, if Alice and Bob share a maximally entangled state, they can achieve a 
higher winning probability than if they share classical correlations only. This is one demon- 
stration of the power of entanglement, and we leave it as an exercise to prove that the 
following quantum strategy achieves a winning probability of cos 2 (7r/8) ~ 0.85 in the CHSH 
game. 

Exercise 3.5.11 Suppose that Alice and Bob share a maximally entangled state |$ + ). Show 
that the following strategy has a winning probability of cos 2 (7r/8). If Alice receives x = 
from the referee, then she performs a measurement of Pauli Z on her system and returns the 
outcome as a. If she receives x = 1, then she performs a measurement of Pauli X and returns 
the outcome as a. If Bob receives y = from the referee, then he performs a measurement 
of (X + Z)/y/2 on his system and returns the outcome as b. If Bob receives y = 1 from the 
referee, then he performs a measurement of [Z — X)/\/2 and returns the outcome as b. 

The Bell States 

There are other useful entangled states besides the standard ebit. Suppose that Alice per- 
forms a Z A operation on her half of the ebit |$ + ) . Then the resulting state is 

\§-) AB = ^(\m) AB -\11) AB ). (3.194) 

Similarly, if Alice performs an X operator or a Y operator, the global state transforms to 
the following respective states (up to a global phase): 



I T + \ AB 

hJj + ) 



T _ \ AB 



^=(\0l) AB + \10) AB ~), (3.195) 

^=(|01} AB -|10) Ai3 ). (3.196) 



The states \& + ) , \$~) , \^ + ) , and \^~) are known as the Bell states and are the 
most important entangled states for a two-qubit system. They form an orthonormal basis, 
called the Bell basis, for a two-qubit space. We can also label the Bell states as 

\$ zx ) AB =(Z A y(X A ) x \$+) AB , (3.197) 
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where the two-bit binary number zx indicates whether Alice applies I , Z , X , or ZX . 
Then the states |<&oo) , |$oi) > I $10) , and |$n) are in correspondence with the 
respective states \<S>+) AB , |* + }^ B , |$-} AB , and |*-} AB . 



Exercise 3.5.12 Show that the Bell states form an orthonormal basis: 



($**|$*v) =<y(«,-z / )<J(a;,x / ). 



Exercise 3.5.13 Show that the following identities hold: 



(3.198) 



|00) 
|01) 

110} 

111) 



AB 



x/2V 
AB = —( 

AB 1 / 

AS J_/ 



1 /"l^ + X-A-B i,_\AB\ 

7*(l* + > +1* > ) 



#- 



#- 



$~ 



AB 



AB 



AB 



^~ 



^~ 



$" 



, AB 



AB 



AB 



(3.199) 

(3.200) 
(3.201) 
(3.202) 



AB 



-) 



AB 



Exercise 3.5.14 Show that the following identities hold by using the relation in (|3.197|): 

AB __ 

~ V2 



(3.203) 






^ ^(|-+)- + |+-r), (3.204) 



4fi ++ ) 



\/2^ 

AB 1 /, , 



AB 



AB 



I + -) 



AB 



AB 



(3.205) 
(3.206) 



Entanglement is one of the most useful resources in quantum computing, quantum com- 
munication, and in the setting of quantum Shannon theory that we explore in this book. Our 
goal in this book is merely to study entanglement as a resource, but there are many other 
aspects of entanglement that one can study, such as measures of entanglement, multiparty 
entanglement, and generalized Bell's inequalities |155| . 



3.6 Summary and Extensions to Qudit States 

We now end our overview of the noiseless quantum theory by summarizing its main postulates 
in terms of quantum states that are on d-dimensional systems. Such states are called qudit 
states, in analogy with the name "qubit" for two-dimensional quantum systems. 
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3.6.1 Qudits 

A qudit state \ip) is an arbitrary superposition of some set of orthonormal basis states 
{|i)} 7 - e (o d-i\ f° r a ^-dimensional quantum system: 

d-i 
M = 5>^'>- (3-207) 

3=0 



d-li i2 

a. 



The amplitudes a,j obey the normalization condition X^ 7 =o 

3.6.2 Unitary Evolution 

The first postulate of the quantum theory is that we can perform a unitary (reversible) 
evolution U on this state. The resulting state is 

U\ip), (3.208) 

meaning that we apply the operator U to the state \ip). 

One example of a unitary evolution is the cyclic shift operator X(x) that acts on the 
orthonormal states {\j)} e r d _ 1 i as follows: 

X(x)\j) = \x®j), (3.209) 

where © is a cyclic addition operator, meaning that the result of the addition is (x + j) mod(d). 
Notice that the X Pauli operator has a similar behavior on the qubit computational basis 
states because 

X\i) = \i®l), (3.210) 

for i G {0, 1}. Therefore, the operator X(x) is one qudit analog of the X Pauli operator. 

Exercise 3.6.1 Show that the inverse of X{x) is X(—x). 

Exercise 3.6.2 Show that the matrix representation X(x) of the X(x) operator is a matrix 
with elements 

[*(*)]« = *«©«■ ( 3 - 211 ) 

Another example of a unitary evolution is the phase operator Z(z). It applies a state- 
dependent phase to a basis state. It acts as follows on the qudit computational basis states 

{|i)}je{o,...,<i-i} : 

Z{z)\j) = exp{i2irzj/d}\j). (3.212) 

This operator is the qudit analog of the Pauli Z operator. The d 2 operators {X(x)Z(z)} x 2g r d _ 1 -, 
are known as the Heisenberg- Weyl operators. 

Exercise 3.6.3 Show that Z(l) is equivalent to the Pauli Z operator for the case that the 
dimension d = 2. 
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Exercise 3.6.4 Show that the inverse of Z{z) is Z(—z). 

Exercise 3.6.5 Show that the matrix representation of the phase operator Z(z) is 

[Z{z\ k = ex V {i27rzj/d}S j;k . (3.213) 

In particular, this result implies that the Z(z) operator has a diagonal matrix representation 
with respect to the qudit computational basis states {|i)}, e / d_u- Thus, the qudit compu- 
tational basis states {\j}} g r d _n are eigenstates of the phase operator Z(z) (similar to the 
qubit computational basis states being eigenstates of the Pauli Z operator). The eigenvalue 
corresponding to the eigenstate \j) is exp{i2irzj/d}. 

Exercise 3.6.6 Show that the eigenstates \l)x of the cyclic shift operator X(l) are the 
Fourier-transformed states \l)x where 

d-l 

\ l ) x = ^^exp{i27r/j/d}|j), (3.214) 

v d j=0 

I is an integer in the set {0, . . . , d — 1}, and the subscript X for the state \l) x indicates 
that it is an X eigenstate. Show that the eigenvalue corresponding to the state \l) x is 
exp{— i2nl/d}. Conclude that these states are also eigenstates of the operator X(x), but the 
corresponding eigenvalues are exp{— i2nlx/d}. 



Exercise 3.6.7 Show that the +/— basis states are a special case of the states in (3.214) 
when d = 2. 

Exercise 3.6.8 The Fourier transform operator F is the qudit analog of the Hadamard H. 
We define it to take Z eigenstates to X eigenstates. 

d-l 

F = J2y)x(J\z, (3-215) 

where the subscript Z indicates a Z eigenstate. It performs the following transformation on 
the qudit computational basis states: 

1 d-l 

\j) -+ -f=J2exp{i2*jk/d}\k). (3.216) 

Show that the following relations hold for the Fourier transform operator F: 

FX(x)F j = Z(x), (3.217) 

FZ(z)F^ = X(-z). (3.218) 
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Exercise 3.6.9 Show that the commutation relations of the cyclic shift operator X(x) and 
the phase operator Z(z) are as follows: 

X(x 1 )Z(z 1 )X(x 2 )Z(z 2 ) = 

e^{2Tii(zix 2 - x 1 z 2 )/d}X(x 2 )Z(z 2 )X(x 1 )Z(z 1 ). (3.219) 

You can get this result by first showing that 

X(x)Z(z) = exp{-27rizx/d}Z(z)X(x). (3.220) 

3.6.3 Measurement of Qudits 

Measurement of qudits is similar to measurement of qubits. Suppose that we have some 
state \ip). Suppose further that we would like to measure some Hermitian operator A with 
the following diagonalization: 

A = Y J ftmv (3-221) 

3 

where Tij\i k = HjSj^, and Ylj^j = I- A measurement of the operator A then returns the 
result j with the following probability: 

p0') = WW), ( 3 - 222 ) 

and the resulting state is 

iP=. (3.223) 

The calculation of the expectation of the operator A is similar to how we calculate in the 
qubit case: 

EL4] = £/(j)<^|n,#) (3.224) 

3 

= <V|]T/(j)n,|V} (3.225) 

3 

= MAW- (3-226) 

We give two quick examples of qudit operators that we might like to measure. The 
operators X(l) and Z(l) are not completely analogous to the respective Pauli X and Pauli 
Z operators because X(l) and Z(l) are not Hermitian. Thus, we cannot directly measure 
these operators. Instead, we construct operators that are essentially equivalent to "measuring 
the operators" X(l) and Z(l). Let us first consider the Z(l) operator. Its eigenstates are 
the qudit computational basis states {|i)} 7£ /o d-i\- We can form the operator Mz(i) as 

d-i 
M z{1) = J2j\J)(J\- (3-227) 

3=0 
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Measuring this operator is equivalent to measuring in the qudit computational basis. The 



expectation of this operator for a qudit \i/j) in the state in (3.207) is 

E[M Z{1] ] = (t/j\M z(1) \ijj) 

d-l d-l d-l 

J'=0 j=0 j"=0 

d-l 
= £ j(y* r a ri {j'\j){j\j") 
f,j,j"=o 

d-l 



Hi 

i=o 



a , 



(3.228) 
(3.229) 

(3.230) 

(3.231) 



Similarly, we can construct an operator M^rn for "measuring the operator -X'(l)" by using 
the eigenstates \j) x of the X(l) operator: 



d-i 

j=0 



x- 



(3.232) 
We leave it as an exercise to determine the expectation when measuring the Mxci) operator. 



Exercise 3.6.10 Suppose the qudit is in the state \xj)) in (3.207). Show that the expectation 
of the Mx(i) operator is 



1 d-l 

E[Mx(i)]=^£ 



3=0 



^2a f exp{-i2TTj'j/d} 

j'=0 



(3.233) 



Hint: First show that we can represent the state \i/j) in the -X'(l) eigenbasis as follows: 

(3.234) 



d-l . /d-l 



£ —j^ £ a J exp{-i2Trlj/d} \l) x . 
i=o ^ d \j=o J 



3.6.4 Composite Systems of Qudits 

We can define a system of multiple qudits again by employing the tensor product. A general 
two-qudit state on systems A and B has the following form: 



d-i 



\0 AB - £ a ilJfc |i>*|*) fl . 

j,k=0 



(3.235) 
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Evolution of two-qudit states is similar as before. Suppose Alice applies a unitary U A to her 
qudit. The result is as follows: 

(U A ® I B ) \i) AB = (U A ® I B ) ]T *iM A \k) B (3-236) 

j,k=0 
d-l 



J2 a ^(u A \j) A )\k) B , (3.237) 



j,k=0 

which follows by linearity. Bob applying a local unitary U B has a similar form. The appli- 
cation of some global unitary U AB is as follows: 

U AB \0 AB . (3.238) 

The Qudit Bell States 

Two-qudit states can be entangled as well. The maximally-entangled qudit state is as follows: 

i$r^|Vi*> B ( 3 - 239 ) 

When Alice possesses the first qudit and Bob possesses the second qudit and they are also 
separated in space, the above state is a resource known as an edit (pronounced "ee • dit"). 
It is useful in the qudit versions of the teleportation protocol and the super-dense coding 
protocol discussed in Chapter [6j 

Consider applying the operator X(x)Z(z) to Alice's side of the maximally entangled state 
|$) . We use the following notation: 



\$x, z ) AB = (X A (x)Z A {z) <8> I B ) \<$>) AB . (3.240) 



d-l 



The d 2 states (l^z)" 45 } _„ are known as the qudit Bell states and are important in qudit 



quantum protocols and in quantum Shannon theory. Exercise |3.6.11| asks you to verify that 
these states form a complete, orthonormal basis. Thus, one can measure two qudits in the 
qudit Bell basis. 

Similar to the qubit case, it is straightforward to see that the qudit state can generate a 



dit of common randomness by extending the arguments in Section |3.5.6| We end our review 
of the noiseless quantum theory with some exercises. The transpose trick is one of our most 
important tools for manipulating maximally entangled states. 

Exercise 3.6.11 Show that the set of states {| < &z.,z) AB } _„ form a complete, orthonormal 

basis: 



($*vl$*,*) =5x,x>5z,z', (3.241) 



d-i 



Yl \®x,z)($x,z\ = I AB - (3-242) 



x,z=0 
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Exercise 3.6.12 (Transpose Trick) Show that the following "transpose trick" or "rico- 
chet" property holds for a maximally entangled state |$) and any matrix M: 

(M A ® I B ) \<P) AB = (l A (g) (M T ) B ) \<S>) AB . (3.243) 

The implication is that some local action of Alice on |$) is equivalent to Bob performing 
the transpose of this action on his half of the state. 

Schmidt decomposition 

The Schmidt decomposition is one of the most important tools for analyzing bipartite pure 
states. The Schmidt decomposition shows that it is possible to decompose any pure, two- 
qudit state as a superposition of corresponding states. We state this result formally as a 
theorem. 

Theorem 3.6.1 (Schmidt decomposition). Suppose that we have a two-qudit pure state, 

\ifj) AB , (3.244) 

where the Hilbert spaces for systems A and B have the same dimension d. Then it is possible 
to express this state as follows: 



d-l 
,AB 



J>|z) A |z) B , (3-245) 



i=0 

where the amplitudes Aj are normalized so that X^|Aj| = 1, the states {\i) } form an 
orthonormal basis for system A, and the states {\i) } form an orthonormal basis for the 
system B . The Schmidt rank of a bipartite state is equal to the number of non-zero coefficients 
Aj in its Schmidt decomposition. 

Proof. We now prove the above theorem. Consider an arbitrary two-qudit pure state \ip) 
We can express it as follows: 

\^) AB = Y. a ^) A \ k ) B ^ ( 3 - 246 ) 

for some amplitudes a,^ and some orthonormal bases {\j) } and {|A;} } on the respective 
systems A and B. Let us write the matrix formed by the coefficients a^ as some matrix A 
where 

[4 ilfc = «*,*■ ( 3 - 247 ) 

This matrix A admits a singular value decomposition of the form 

A = UAV, (3.248) 

where the matrices U and V are both unitary and the matrix A is diagonal. Let us write 
the matrix elements of U as Ujj, those of A as Aj (we only need one index for the elements 
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along the diagonal), and those of V as v^. The above matrix equation is then equal to the 
following set of equations: 

a j,k = /] Uj,iXjVi,k- (3.249) 



Let us make this substitution into the expression for the state in (3.246): 



E ( E u iAiVi,k j \j) A \k) B . (3.250) 

j.k \ i / 



Readjusting some terms by exploiting the properties of the tensor product, we find that 



AB 



EM E^n ® (E^m ( 3 - 251 ) 

= E A ^i i > s > ( 3 - 252 ) 

where we define the orthonormal basis on the A system as \i) = Ylj u j,i\J) an d we define 
the orthonormal basis on the B system as \i) = ^fc u i,fc|^) • This final step completes the 
proof of the theorem, but the next exercise asks you to verify that the set of states {\i) } 
form an orthonormal basis (the proof for the set of states (|i) } is similar). □ 

Remark 3.6.1 The Schmidt decomposition applies not only to bipartite systems but to any 
number of systems where we can make a bipartite cut of the systems. For example, suppose 
that there is a state \<f>) on systems ABODE. We could say that AB are part of one 
system and ODE are part of another system and write a Schmidt decomposition for this 
state as follows: 

^abcde = j2 Vp^y)\y) AB \y) CDE , (3.253) 

y 

where {\y) } is an orthonormal basis for the joint system AB and {\y) } is an orthonor- 
mal basis for the joint system ODE. 

Exercise 3.6.13 Verify that the set of states {\i) } form an orthonormal basis by exploiting 
the unitarity of the matrix U. 

3.7 History and Further Reading 

There are many great books on quantum mechanics that outline the mathematical back- 
ground. The books of Bohm [4T] . Sakurai [209] , and Nielsen and Chuang |197] are among 
these. The ideas for the resource inequality formalism first appeared in a popular article of 
Bennett [19] and another of his papers [20] . The no-deletion theorem is in Ref . |202] . The 
review article of the Horodecki family is a useful reference on the study of entanglement |155] . 
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The Noisy Quantum Theory 



In general, we may not know for certain whether we possess a particular quantum state. 
Instead, we may only have a probabilistic description of an ensemble of quantum states. 
This chapter re-establishes the postulates of the quantum theory so that they incorporate a 
lack of complete information about a quantum system. The density operator formalism is a 
powerful mathematical tool for describing this scenario. This chapter also establishes how 
to model the noisy evolution of a quantum system, and we explore models of noisy quantum 



channels that are the analogs of the noisy classical channel discussed in Section |2.2.3| of 
Chapter [2} 

You might have noticed that the development in the previous chapter relied on the 
premise that the possessor of a quantum system has perfect knowledge of the state of a 
given system. For instance, we assumed that Alice knows that she possesses a qubit in the 
state \i/j) where 

\^)=a\0)+p\l). (4.1) 

Also, we assumed that Alice and Bob might know that they share an ebit |$ + ). We even as- 
sumed perfect knowledge of a unitary evolution or a particular measurement that a possessor 
of a quantum state may apply to it. 

This assumption of perfect, definite knowledge of a quantum state is a difficult one to 
justify in practice. In reality, it is difficult to prepare, evolve, or measure a quantum state 
exactly as we wish. Slight errors may occur in the preparation, evolution, or measurement 
due to imprecise devices or to coupling with other degrees of freedom outside of the system 
that we are controlling. An example of such imprecision can occur in the coupling of two 
photons at a beamsplitter. We may not be able to tune the reflectivity of the beamsplitter 
exactly or may not have the timing of the arrival of the photons exactly set. The noiseless 
quantum theory as we presented it in the previous section cannot handle such imprecisions. 

In this chapter, we relax the assumption of perfect knowledge of the preparation, evolu- 
tion, or measurement of quantum states and develop a noisy quantum theory that incorpo- 
rates an imprecise knowledge of these states. The noisy quantum theory fuses probability 
theory and the quantum theory into one formalism. 

We proceed with the development of the noisy quantum theory in the following order: 
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1. We first present the density operator formalism, which gives a representation for a 
noisy, imprecise quantum state. 

2. We then discuss the general form of measurements and the effect of them on our 
description of a noisy quantum state. We specifically discuss the POVM (positive 
operator- valued measure) formalism that gives a more general way of describing mea- 
surements. 

3. We proceed to composite noisy systems, which admit a particular form, and we discuss 
several possible states of composite noisy systems including product states, separable 
states, classical-quantum states, entangled states, and arbitrary states. 

4. Next, we consider the Kraus representation of a noisy quantum channel, which gives a 
way to describe noisy evolution, and we discuss important examples of noisy quantum 
channels. 



4.1 Noisy Quantum States 



We generally may not have perfect knowledge of a prepared quantum state. Suppose a third 
party, Bob, prepares a state for us and only gives us a probabilistic description of it. We may 
only know that Bob selects the state \ip x ) with a certain probability px(x). Our description 
of the state is then with an ensemble £ of quantum states where 

£ = {p x (x),\^ x )} xex . (4.2) 

In the above, X is a random variable with distribution px(x). Each realization x of random 
variable X belongs to an alphabet X . For our purposes, it is sufficient for us to say that 
Af = {l,...,|Af|}. Thus, the realization x merely acts as an index, meaning that the quantum 
state is \tp x ) with probability px(x). We also assume that each state \tp x ) is a qudit state 
that lives on a system of dimension d. 

A simple example is the following ensemble: 

i.|l>}.{||3>}}. (4.3) 

The states |1) and |3) live on a four-dimensional system with basis states 

(|0),|1),|2),|3)}. (4.4) 

The interpretation of this ensemble is that the state is |1) with probability 1/3 and the state 
is |3) with probability 2/3. 
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4.1.1 The Density Operator 

Suppose now that we have the ability to perform a perfect measurement of a system with 



ensemble description £ in (4.2). Let Hj be the elements of this projective measurement 
so that 52 . IIj = /, and let J be the random variable that denotes the index j of the 
measurement outcome. Let us suppose at first, without loss of generality, that the state in 
the ensemble is \ij) x ) for some x G X. Then the Born rule of the noiseless quantum theory 
states that the conditional probability pj\x{j\x) of obtaining measurement result j (given 
that the state is \ip x )) is 

pj\x{j\x) = (^in.l^), (4.5) 

and the post-measurement state is 

^m . (4.6) 

But, we would also like to know the actual probability pj(j) of obtaining measurement result 
j for the ensemble description £. By the law of total probability, the unconditional probability 

PjU) is 



PjU) = 2_^PJ\x{j\x)Px{x) (4.7) 

x&X 

= Y J M^j\^)Px(x). (4.8) 



x&X 



The trace Tr{y4} of an operator A is 

Tr{yl} = ^(z|A|i), (4.9) 

i 

where \i) is some complete, orthonormal basis. (Observe that the trace operation is linear.) 
We can then show the following useful property with the above definition: 

Tr{nAi(j x )(ilj x \} = £(*|IIil^><^*|*> (4.10) 

i 

= J2(^x\i)m J \i; x ) (4.11) 

i 

= (^IIL,!^). (4.13) 

The last equality uses the completeness relation 52*1*) (*l = I- Thus, we continue with the 



development in (4.8) and show that 

pj{j) = ^Tr{n i |^)(^|}p x (x) (4.14) 



x&X 



Trlu^Pxix)^)^}. (4.15) 



x&X 
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We can rewrite the last equation as follows: 

Pj (j) = Tv{U jP }, (4.16) 

where we define the density operator p as 

p = Y,Px{x)\il> x ){il> x \. (4.17) 

x&X 

The above operator is known as the density operator because it is the quantum analog of a 
probability density function. 

We sometimes refer to the density operator as the expected density operator because 
there is a sense in which we are taking the expectation over all of the states in the ensemble 
in order to obtain the density operator. We can equivalenty write the density operator as 
follows: 

p = E x {\^ x ){^ x \}, (4.18) 

where the expectation is with respect to the random variable X. Note that we are careful 
to use the notation \if) X ) instead of the notation \ip x ) for the state inside of the expectation 
because the state \ipx) is a random quantum state, random with respect to a classical random 
variable X . 

Exercise 4.1.1 Suppose the ensemble has a degenerate probability distribution, say p x (0) = 
1 and p x (x) = for all x ^ 0. What is the density operator of this degenerate ensemble? 

Properties of the Density Operator 

What are the properties that a given density operator must satisfy? Let us consider taking 
the trace of p: 



Tr{p} = Trl J>x(x)|^)(^ x | I (4.19) 

Vx&X J 

= J2Px(x)Tr{\^ x )(iP x \} (4.20) 

x£X 

= J>x(x)(^|^> (4-21) 

x&X 

x&X 

= 1. (4.23) 

The above development shows that every density operator should have unit trace because it 
arises from an ensemble of quantum states. Every density operator is also positive, meaning 
that 

V|^} : t#\ P \<p) > 0. (4.24) 
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We write p > to indicate that an operator is positive. The proof of positivity of any density 
operator p is as follows: 



(if\p\<f) = (<p\ J>x(z)|^>(^| \<f) (4.25) 
\xcx / 

= J>x(x)(v#*>(^b> (4.26) 

= ^^(^)IMV,}| 2 >0. (4.27) 



The inequality follows because each px(x) is a probability and is therefore non- negative. 
Let us consider taking the conjugate transpose of the density operator p: 

P ] = (l>*(aOI^><^*|) (4-28) 

= Y J Px{x){\^ x ){ij x \) ] (4.29) 

x£X 

= I>x(*)|^><^| (4-30) 

= p. (4.31) 

Every density operator is thus a Hermitian operator as well because the conjugate transpose 
of p is p. 

Ensembles and the Density Operator 

Every ensemble has a unique density operator, but the opposite does not necessarily hold: 
every density operator does not correspond to a unique ensemble and could correspond to 
many ensembles. 

Exercise 4.1.2 Show that the ensembles 

{{1/2,|0}},{1/2,|1}}} (4.32) 

and 

{{1/2,|+}}, {1/2, 1-}}} (4.33) 

have the same density operator. 

This last result has profound implications for the predictions of the quantum theory 
because it is possible for two or more completely different ensembles to have the same prob- 
abilities for measurement results. It also has important implications for quantum Shannon 
theory as well. 
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By the spectral theorem, it follows that every density operator p has a spectral decom- 
position in terms of its eigenstates {\4>x)} xG r <j_u because every p is Hermitian: 

d-i 



J>*I^>(M ( 4 - 34 ) 



where the coefficients X x are the eigenvalues. 

Exercise 4.1.3 Show that the coefficients X x are probabilities using the facts that Tr{p} = 1 
and p > 0. 

Thus, given any density operator p, we can define a "canonical" ensemble {X x , \4> x )} 
corresponding to it. This observation is so important for quantum Shannon theory that we 
see this idea arise again and again throughout this book. The ensemble arising from the 
spectral theorem is the most "efficient" ensemble, in a sense, and we will explore this idea 



more in Chapter 17 on quantum compression (known as Schumacher compression after its 
inventor). 

Density Operator as the State 

We can also refer to the density operator as the state of a given quantum system because 
it is possible to use it to calculate all of the predictions of the quantum theory. We can 
make these calculations without having an ensemble description — all we need is the density 
operator. The noisy quantum theory also subsumes the noiseless quantum theory because 
any state \i/j) has a corresponding density operator \i{))(ip\ in the noisy quantum theory, and 
all calculations with this density operator in the noisy quantum theory give the same results 
as using the state \ip) in the noiseless quantum theory. For these reasons, we will say that 
the state of a given quantum system is a density operator. 

One of the most important states in the noisy quantum theory is the maximally mixed 
state it. The maximally mixed state tt arises as the density operator of a uniform ensemble of 
orthogonal states {4, \x) }, where d is the dimensionality of the Hilbert space. The maximally 
mixed state tt is then equal to 

« = \Y.\ x )( x \ = I - d - ( 4 - 35 ) 

x&X 

Exercise 4.1.4 Show that it is the density operator of the ensemble that chooses |0), |1), 
|+), | — } with equal probability. 

The purity P(p) of a density operator p is equal to 

P(p) = Tr{pV} = Tr{p 2 }. (4.36) 

The purity is one particular measure of the noisiness of a quantum state. The purity of a 
pure state is equal to one, and the purity of a mixed state is strictly less than one. 
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The Density Operator on the Bloch Sphere 

Consider that the density operator of a pure qubit state 



cos I ? ]|0}+e^sin( |)|1) 



(4.37) 



has the following density operator representation: 



\r)(r\= (cos(-j|0} + e^sin[-j|l}][cosl-l(0|+e-^sin 



(4,31 



cos- 1 - ||0}(0|+e-^sin - cos - |0}(1| 



e^sin(-jcos[-j|l}(0|+sin 2 [-)M)(ll 



(4.39) 



The matrix representation, or density matrix, of this density operator with respect to the 
computational basis is as follows: 



cos 2 (f) e-^sin(f)cos(|) 

3^sin(|)cos(|) sin 2 (f) 



(4.40) 



Using trigonometric identities, it follows that the density matrix is equal to the following 
matrix: 



1 + cos(#) sin(6')(cos((/?) — isin(yj)) 

sin(6 , )(cos(v?) + isin(<p)) 1 — cos(#) 



(4.4i; 



We can further exploit the Pauli matrices, denned in Section 3.3.3, to represent the density 
matrix as follows: 

l -(I + r x X + r y Y + r z Z), (4.42) 

where 



sin(#) cos (</?), 
sin(0) sin(y?), 
cos(6>). 



(4.43) 
(4.44) 
(4.45) 



The coefficients r x , r y , and r z are none other than the Cartesian representation of the angles 
9 and ip, and they thus correspond to a unit vector. 



More generally, the formula in (4.42) can represent an arbitrary density operator where 



the coefficients r x , r y , and r z do not necessarily correspond to a unit vector, but rather a 



vector r such that ||r|L < 1. Consider that the density matrix in (4.42) is as follows: 



r x + it 



•i 



x bl y 

1 - r. 



(4.46) 
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The above matrix corresponds to a valid density matrix because it has unit trace, it is 
Hermitian, and it is positive (the next exercise asks you to verify these facts). This alternate 
representation of the density matrix as a vector in the Bloch sphere is useful for visualizing 
noisy processes in the noisy quantum theory. 



Exercise 4.1.5 Show that the matrix in (4.46) has unit trace, is Hermitian, and is positive 
for all r such that ||r|| 2 < 1. It thus corresponds to any valid density matrix. 

Exercise 4.1.6 Show that we can compute the Bloch sphere coordinates r x , r y , and r z with 



the respective formulas Tr{pX}, Tr{pY}, and Tr{pZ} using the representation in (4.46) and 
the result of Exercise 13.3.51 

Exercise 4.1.7 Show that the eigenvalues of a general qubit density operator with density 

ows: 

(l±l|r|| 2 ). (4.47) 



matrix representation in (4.46) are as follows: 

1 
2 



Exercise 4.1.8 Show that a mixture of pure states \ijjj) each with Bloch vector Yj and 
probability p(j) gives a density matrix with the Bloch vector r where 



4.1.2 An Ensemble of Ensembles 

The most general ensemble that we can construct is an ensemble of ensembles, i.e., an 
ensemble T of density operators where 

T={p x (x),p x }. (4.49) 

The ensemble T essentially has two layers of randomization. The first layer is from the dis- 
tribution px{%)- Each density operator p x in T arises from an ensemble {pY\x(y\x), \ip xy )}- 
The conditional distribution Py\x{v\x) represents the second layer of randomization. Each p x 
is a density operator with respect to the above ensemble: 

Px = J2pY\x(y\x)\^ X y){tp X y\. (4.50) 

y 
The ensemble T has its own density operator p where 

P = J2PY\x(y\x)p X (x)\tfj X y){lfj X y\ (4.51) 

x,y 
= J2px(x) Px . (4.52) 

X 

The density operator p is the density operator from the perspective of someone who is 



ignorant of x. Figure 4.1 displays the process by which we can select the ensemble T. 
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I ixyf 



i xy __j 

Figure 4.1: The mixing process by which we can generate an "ensemble of ensembles." First choose a 
realization x according to distribution px{x). Then choose a realization y according to the conditional 
distribution Py\x(u\ x )- Finally, choose a state \ipx,y) according to the realizations x and y. This leads to an 
ensemble {p x (x), p x } where p x = Y. y PY\x{v\x)\ipx,y){^x, y \- 



4.1.3 Noiseless Evolution of an Ensemble 

Quantum states can evolve in a noiseless fashion either according to a unitary operator or 
a measurement. In this section, we determine the noiseless evolution of an ensemble and its 



corresponding density operator. (We consider noisy evolution in Section 4.4 



Noiseless Unitary Evolution of a Noisy State 

We first consider noiseless evolution according to some unitary U . Suppose we have the 



ensemble £ in (4.2) with density operator p. Suppose without loss of generality that the 
state is \ip x ). Then the evolution postulate of the noiseless quantum theory gives that the 
state after the unitary evolution is as follows: 

U\ip x ). (4.53) 

This result implies that the evolution leads to a new ensemble 

£tf = {px(s),W,>}*=*- ( 4 -54) 

The density operator of the evolved ensemble is 

Y,Px(x)U\i/j x )(i/j x \U^ = u(^ Px (x)\i/j x )(iIj x \\u^ (4.55) 

x&X \x£X J 

= UpU ] . (4.56) 

Thus, the above relation shows that we can keep track of the evolution of the density 
operator p, rather than worrying about keeping track of the evolution of every state in 
the ensemble £. It suffices to keep track of only the density operator evolution because this 
operator is sufficient to determine the predictions of the quantum theory. 

Noiseless Measurement of a Noisy State 

In a similar fashion, we can analyze the result of a measurement on a system with ensemble 



description £ in (4.2). Suppose that we perform a projective measurement with projection 
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operators {U-j} ■ where J^ . 11^ = /. Suppose further without loss of generality that the state 
in the ensemble is \ip x )- Then the noiseless quantum theory predicts that the probability of 
obtaining outcome j conditioned on the index x is 

PJ\x(j\x) = (^III^), (4.57) 

and the resulting state is 



n,#.) 



(4.58) 



y/pj\xU\xj 

Supposing that we receive outcome j, then we have a new ensemble: 

8j = {p^Mj)^^)/ Jpj\ X {j\x)} . (4.59) 

^ v J xdX 

The density operator for this ensemble is 

2_.Px\j(x\j) ... . 

xTx PJ\X(J\X) 



ME™>^iH (4,0) 



rr (x^PJ\xU\ x )Px(x) \ 



nj(E«=*px(*)l<M<<M)n 



J 



pjU) 



(4.62) 
(4.63) 



PjU) 
The second equality follows from applying the Bayes rule: 

Px\j(x\j) =pj\x(j\x)p x (x)/pj(j). (4.64) 

The above expression gives the evolution of the density operator under a measurement. We 
can again employ the law of total probability to compute that pj(j) is 



PjU) = ^2pj\xU\x)px(x) (4.65) 

x&X 

= Y,PxU)(^\KjM (4-66) 

x&X 

= Y,PxU)Ti{\^ x )^ x \Il 3 } (4.67) 



x£X 



Tt\J2px(x)\^x)^x\IiA (4.66 



.xex 



Tr{pIL,}. (4.69) 
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We can think of Trjpilj} as the area of the shadow of p onto the space that the projector IL,- 
projects. 

4.1.4 Probability Theory as a Special Case of the Noisy Quantum 
Theory 

It may help to build some intuition for the noisy quantum theory by showing how it contains 
probability theory as a special case. Indeed, we should expect this containment of probability 
theory within the noisy quantum theory to hold if the noisy quantum theory is making 
probabilistic predictions about the physical world. 

Let us again begin with an ensemble of quantum states, but this time, let us pick the 
states in the ensemble to be special states, where they are all orthogonal to one another. If 
the states in the ensemble are all orthogonal to one another, then they are essentially classical 
states because there is a measurement that distinguishes them from one another. So, let us 
pick the ensemble to be {px{x), \x)} xeX where the states {l^}}^^ form an orthonormal basis 
for a Hilbert space of dimension \X\. These states are classical because a measurement with 
the following projection operators can distinguish them: 

{\x)(x\} xex . (4.70) 

The formal analogy of a probability distribution in the quantum world is the density 
operator: 

p x {x)^p. (4.71) 

The reason for this analogy is that we can use the density operator to calculate expectations 
and moments of observables. 

The formal analogy of a random variable is an observable. Let us consider the following 
observable: 

X = ^2x\x)(x\, (4.72) 

x&X 



analogous to the observable in (3.227). We perform the following calculation to determine 



the expectation of the observable X: 

ELY] = Tr{Xp}. (4.73) 

Explicitly calculating this quantity, we discover that it is the same computation as that for 
the expectation of random variable X: 

Tr{Xp} = Tr j J^ x\x)(x\ J2 Px(x')\x')(x'\ 1 (4.74) 

Kxex x'qx ) 

= Y, x Px (x')\(x\x')\ 2 (4.75) 

x,x'ex 

= Yxp x {x). (4.76) 

x&X 
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Another useful notion in probability theory is the notion of an indicator random variable Ia{X). 
We define the indicator function Ia(x) as follows: 

W - { I ; * l A A ■ (4-77) 

The expectation E[/^(A A )] of the indicator random variable I a (X) is 

E[I a (X)] = J2px(x). (4.78) 

= Px(A), (4.79) 

where px (A) represents the probability of the set A. In the quantum theory, we can formulate 
an indicator observable Ia{X): 

I a {X) = J>X*I- ( 4 -80) 

x£A 

It has eigenvalues equal to unity for all eigenvectors with labels x in the set A, and it has 
null eigenvalues for those eigenvectors with labels outside of A. It is straightforward to show 
that the expectation Tt{pIa(X)} of the indicator observable Ia(X) is px(A). 

You may have noticed that the indicator observable is also a projection operator. So, 
according to the postulates of the quantum theory, we can perform a measurement with 
elements: 

{I A (X), I A c(X) = 1- I a (X)}. (4.81) 

The result of such a projective measurement is to project onto the subspace given by Ia{X) 
with probability px(A) and to project onto the complementary subspace given by Ta^(X) 
with probability 1 — px{A). 

We highlight the connection between the noisy quantum theory and probability theory 
with two more examples. First, suppose that we have two disjoint sets A and B. Then the 
probability of their union is the sum of the probabilities of the individual sets: 

p(AuB)=p(A)+p(B), (4.82) 

and the probability of the complementary set (A U B) c = A C C\B C is equal to l—p(A)—p(B). 
We can perform the analogous calculation in the noisy quantum theory. Let us consider two 
projection operators 

Ii A = J>)<s|, ( 4 - 83 ) 

x£A 

Tl b = ^2\x)(x\. (4.84) 

x£B 

The sum of these projection operators gives a projection onto the union set AU B: 

U A ub= J2 \x)(x\ = U a + U b . (4.85) 

rrgAU-B 
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Exercise 4.1.9 Show that Tr {HaubP} = p(A) + p(B) whenever the projectors 11^ and Il B 
do not have overlapping support and the density operator p is diagonal in the same basis as 
11a and lis- 

We can also consider intersections of sets. Suppose that we have two sets A and B. 
The intersection of these two sets consists of all the elements that are common to both sets. 
There is an associated probability p(A n B) with the intersection. We can again formulate 



this idea in the noisy quantum theory. Consider the projection operators in (4.83 4.84). 
The multiplication of these two projectors gives a projector onto the intersection of the two 
spaces: 

Il Ar]B = U A U B . (4.86) 

Exercise 4.1.10 Show that TrjII^IiBp} = p{A fl B) whenever the density operator p is 
diagonal in the same basis as II ^ and H B - 

Such ideas and connections to the classical world are crucial for understanding quantum 
Shannon theory. Many times, we will be thinking about unions of disjoint subspaces and it is 



useful to make the analogy with a union of disjoint sets. Also, in Chapter 16 on the Covering 
Lemma, we will use projection operators to remove some of the support of an operator, and 
this operation is analogous to taking intersections of sets. 

Despite the fact that there is a strong connection for classical states, some of this intuition 
breaks down by considering the non-orthogonality of quantum states. For example, consider 
the case of the projectors n = |0}(0| and n + = |+}(+|. The two subspaces onto which these 
operators project do not intersect, yet we know that the projectors have some overlap because 
their corresponding states are non-orthogonal. One analogy of the intersection operation is 
to slice out the support of one operator from another. For example, we can form the operator 

n n + n , (4.87) 

and placing Ilo on the outside slices out the support of Il + that is not in Ilo. Similarly, we 
can slice out the support of n not in II + by forming the following operator 

n + n n+. (4.88) 

If the two projectors were to commute, then this ordering would not matter, and the resulting 
operator would be a projector onto the intersection of the two subspaces. But this is not the 
case for our example here, and the resulting operators are quite different. 

Exercise 4.1.11 (Union Bound) Prove a union bound for commuting projectors IT and 
II2 where < TIi, TI2 < / and for an arbitrary density operator p (not necessarily diagonal 
in the same basis as IT and n 2 ): 

Tr{(7 - ITn^p} < Tr{(/ - IT)/)} + Tr{(7 - n 2 )p}. (4.89) 
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4.2 Measurement in the Noisy Quantum Theory 

We have described measurement in the quantum theory using a set of projectors that form a 
resolution of the identity. For example, the set {IT,,} . of projectors that satisfy the condition 
^„ Uj = I form a valid von Neumann quantum measurement. A projective measurement is 
not the most general measurement that we can perform on a quantum system (though it is 
certainly one valid type of quantum measurement). 

The most general quantum measurement consists of a set of measurement operators 
{Mj}. that satisfy the following completeness condition: 

J2 M j M j = I- ( 4 -90) 

3 

Suppose that we have a pure state \-ifj). Given a set of measurement operators of the 
above form, the probability for obtaining outcome j is 

p(j) = <^|mJm,#} ; (4.91) 

and the post-measurement state when we receive outcome j is 

MM) 

Jr (492) 

Suppose that we instead have an ensemble {px(x), \ifi x )} with density operator p. We 



can carry out the same analysis in (4.63) to show that the post-measurement state when we 
measure result j is 

iuf (4 - 93) 

where the probability p(j) for obtaining outcome j is 

p(j) = Tr{M}M jP }. (4.94) 

4.2.1 POVM Formalism 

Sometimes, we simply may not care about the post-measurement state of a quantum mea- 
surement, but instead we only care about the probability for obtaining a particular outcome. 
This situation arises in the transmission of classical data over a quantum channel. In this 
situation, we are merely concerned with minimizing the error probabilities of the classical 
transmission. The receiver does not care about the post-measurement state because he no 
longer needs it in the quantum information processing protocol. 

We can specify a measurement of this sort by some set {Aj} . of operators that satisfy 
positivity and completeness: 

Aj > 0, (4.95) 

J> i = J. (4.96) 
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The set {Aj} . of operators is a positive operator- valued measure (POVM). The probability 
for obtaining outcome j is 

MAiM, (4.97) 

if the state is some pure state |-0). The probability for obtaining outcome j is 

Tr{A jP }, (4.98) 

if the state is in a mixed state described by some density operator p. 
Exercise 4.2.1 Consider the following five "Chrysler" states: 

|e fc )=cos( — J|0) + sin(— J|l), (4.99) 

where k G {0, . . . ,4}. These states are the "Chrysler" states because they form a pentagon 
on the ATZ-plane of the Bloch sphere. Show that the following set of states form a valid 
POVM: 

2 \e k )(e k \y (4.100) 

Exercise 4.2.2 Suppose we have an ensemble {p(x),p x } of density operators and a POVM 
with elements {A x } that should identify the states p x with high probability, i.e., we would 
like Tr{p x A x } to be as high as possible. The expected success probability of the POVM is 
then 

Y,P(x)Tr{p x A x }. (4.101) 

X 

Suppose that there exists some operator r such that 

r > p(x)p x , (4.102) 

where the condition r > p(x)p x is the same as r — p(x)p x > (the operator r — p(x)p x is a 
positive operator). Show that Tr{r} is an upper bound on the expected success probability of 
the POVM. After doing so, consider the case of encoding n bits into a d-dimensional subspace. 
By choosing states uniformly at random (in the case of the ensemble {2 _n , Pi} ie r 1 yn), show 
that the expected success probability is bounded above by d 2~ n . Thus, it is not possible 
to store more than n classical bits in n qubits and have a perfect success probability of 
retrieval (this is a simplified version of the Holevo bound, about which we will learn more in 



Chapter 11) 



4.3 Composite Noisy Quantum Systems 

We are again interested in the behavior of two or more quantum systems when we join 
them together. Some of the most exotic, truly "quantum" behavior occurs in joint quantum 
systems, and we observe a marked departure from the classical world. 
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4.3.1 Independent Ensembles 

Let us first suppose that we have two independent ensembles for quantum systems A and 
B. The first quantum system belongs to Alice and the second quantum system belongs to 
Bob, and they may or may not be spatially separated. Let {px(x), \ip x )} be the ensemble 
for the system A and let {py(y), {(fry)} be the ensemble for the system B. Suppose for now 
that the state on system A is \if) x ) for some x and the state on system B is \(f) y ) for some 
y. Then, using the composite system postulate of the noiseless quantum theory, the joint 
state for a given x and y is \ip x ) <8> \4> y )- The density operator for the joint quantum system 
is the expectation of the states \ip x ) <8> \4> y ) with respect to the random variables X and Y 
that describe the individual ensembles: 

E x , Y {(\tp x ) (8) \<fr)){W>x\ ® (0y|)}- (4.103) 

The above expression is equal to the following one: 

Ex,r{|<M<</>x|®|<M<<M}, (4.104) 

because (\ip x ) ® |0j/))((^x| ® (0j/|) = \ip x )(ipx\ ® |0y)(0j/|- We then explicitly write out the 
expectation as a sum over probabilities: 

^2px(x)MyMx){l/'x\ ® \<f>y){<Pyl (4-105) 

We can distribute the probabilities and the sum because the tensor product obeys a dis- 
tributive property: 

^2'Px(x)\tP x )(tP x \ ® ^ y (2/)l<^}<<^l- (4.106) 

x y 

The density operator for this ensemble admits the following simple form: 

p®a, (4.107) 

where p is the density operator of the X ensemble and a is the density operator of the Y 
ensemble. We can say that Alice's local density operator is p and Bob's local density operator 
is a. We call a density operator of the above form a product state. We should expect the 
density operator to factorize as it does above because we assumed that the ensembles are 
independent. There is nothing much that distinguishes this situation from the classical 
world, except for the fact that the states in each respective ensemble may be non-orthogonal 
to other states in the same ensemble. But even here, there is some equivalent description 
of each ensemble in terms of an orthonormal basis so that there is really no difference 
between this description and a joint probability distribution that factors as two independent 
distributions. 

Exercise 4.3.1 Show that the purity P(p A ) is equal to the following expression 

p(p A ) =Tx{(p A ®p A ')F AA ') (4.108) 
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where system A' has a Hilbert space structure isomorphic to that of system A and F AA is 
the swap operator that has the following action on kets in A and A': 

Vx,y F AA '\x) A \y) A ' = \y) A \xf . (4.109) 

(Hint: First show that Tr{/(p A )} =Tr{(/(p A ) <g) I A ')F AA '} for any function / on the op- 
erators in system A.) 

4.3.2 Separable States 

Let us now consider two systems A and B whose corresponding ensembles are correlated. 
We describe this correlated ensemble as the joint ensemble 

{px{x)Mx)®\<l>*)}- (4-110) 

It is straightforward to verify that the density operator of this correlated ensemble has the 
following form: 

Ex{(|Vx) ® \<t>x)){tyx\ ® (0x|)} = $>xOr)|^}(^l ® I^X^I- (4-111) 

The above state is a separable state. The term "separable" implies that there is no quantum 
entanglement in the above state, i.e., there is a completely classical procedure that prepares 
the above state. By ignoring Bob's system, Alice's local density operator is of the form 

Ex{\^x)(^x\} = 5>x(a0hk)(^*|, (4-112) 

X 

and similarly, Bob's local density operator is 

E X {|0 X )(0 X |} = J>x(*XM (4-113) 

X 

We can generalize this classical preparation procedure one step further, using an idea 



similar to the "ensemble of ensembles" idea in Section 4.1.2 Let us suppose that we first 
generate a random variable Z according to some distribution pz{z). We then generate two 
other ensembles, conditional on the value of the random variable Z. Let {px\z( x \z) , \ip x ,z)} 
be the first ensemble and let {py\z{v\z), \4> y ,z)} be the second ensemble, where the random 
variables X and Y are independent when conditioned on Z. Let us label the density operators 
of the first and second ensembles when conditioned on a particular realization z by p z and a z , 
respectively. It is then straightforward to verify that the density operator of an ensemble 
created from this classical preparation procedure has the following form: 

^x,y,z{(\^x,z) ® \<f)Y,z))((ipx,z\ <8> (4>y,z\)} = J2 Pz ^ pz ® ° z ' (4.114) 
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Exercise 4.3.2 By ignoring Bob's system, we can determine Alice's local density operator. 
Show that 

V>x,yA\4>x,z)(4>x,z\} = ^2pz{z)p x , (4-115) 

z 

so that the above expression is the density operator for Alice. It similarly follows that the 
local density operator for Bob is 

Ex,Y,z{\<f>Y,z)(<f)Y,z\} = ^Pz{z)<T z . ( 4 - 116 ) 



The density operator in (4.114) is the most general form of a separable state because the 
above procedure is the most general classical preparation procedure that produces classical 
correlations (we could generalize further with more ensembles of ensembles, but they would 
ultimately lead to this form because the set of separable states is a convex set). A state is 



entangled if we cannot write it in the form in (4.114), as a convex combination of product 

states. 

Exercise 4.3.3 Show that we can always write a separable state as a convex combination 
of pure product states: 

Y,Pz(*)\<i>z)(<i>z\®\A)(^\, (4-117) 



by manipulating the general form in (4.114). 



4.3.3 Local Density Operator 

A First Example 

Consider the entangled Bell state |$ + ) shared on systems A and B. In the above analyses, 
we were concerned with determining a local density operator description for both Alice 
and Bob. Now, we are curious if it is possible to determine such a local density operator 
description for Alice and Bob with respect to the state |$ + ) 

As a first approach to this issue, recall that the density operator description arises from 
its usefulness in determining the probabilities of the outcomes of a particular measurement. 
We say that the density operator is "the state" of the system merely because it is a mathe- 
matical representation that allows us to compute the probabilities resulting from a physical 
measurement. So, if we would like to determine a "local density operator," such a local 
density operator should predict the result of a local measurement. 

Let us consider a local measurement with measurement operators {M m } m that Alice can 
perform on her system. The global measurement operators for this local measurement are 
{M£ (g) I s } because nothing (the identity) happens to Bob's system. The probability of 
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obtaining result m is 

i I 

($ + \M A ® / B |$+> = - £ (ii\M A ® J B |jj) (4.118) 

i,j=0 



= 2E<^b')^) ( 4 - 119 ) 

= i((0|M^|0) + (l|M^|l}) (4.120) 

= ^(Tr{M^|0)(0| A }+Tr{M^|l)(l| A }) (4.121) 

= Tr{M^i(|0)(0| A + |l)(l| A )} (4.122) 

= Tr{M^ A }. (4.123) 

The above steps follow by applying the rules of taking the inner product with respect to 

tensor product operators. The last line follows by recalling the definition of the maximally 



mixed state it in (4.35), where tt here is a qubit maximally mixed state. 

The above calculation demonstrates that we can predict the result of any local "Alice" 
measurement using the density operator it. Therefore, it is reasonable to say that Alice's 
local density operator is 7r, and we even go as far to say that her local state is it. A symmetric 
calculation shows that Bob's local state is also ir. 

This result concerning their local density operators may seem strange at first. The 
following global state gives equivalent predictions for local measurements: 



7T A ®7T B . (4.124) 



Can we then conclude that an equivalent representation of the global state is the above 



AB 



state? Absolutely not. The global state |$ + ) and the above state give drastically dif- 



ferent predictions for global measurements. Exercise 4.3.5 below asks you to determine the 
probabilities for measuring the global operator Z A <S> Z B when the global state is |$ + ) or 
7i A (g) 7T B , and the result is that the predictions are dramatically different. 

Exercise 4.3.4 Show that the projection operators corresponding to a measurement of the 
observable Z A ® Z B are as follows: 

fleven = \ (/ A ® I* + Z A <g> Z B ) = |00><00|^^ + 1 1 1 ) < 1 1 1 ^^ , (4.125) 

n odd = 1 -{I A ® I B - Z A (g) Z B ) = |01}(01| AS + |10}(10| AB . (4.126) 

This measurement is a parity measurement, where measurement operator II even coherently 
measures even parity and measurement operator Ii odd measures odd parity. 
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Exercise 4.3.5 Show that a parity measurement (defined in the previous exercise) of the 
state |$ + ) returns an even parity result with probability one, and a parity measurement 
of the state ir A <g> ir B returns even or odd parity with equal probability Thus, despite the 
fact that these states have the same local description, their global behavior is very different. 
Show that the same is true for the phase parity measurement, given by 

nx,even ^(J^^ + I^ X B ), (4.127) 

nx.odd = l - {I A ® I B - X A ® X B ) . (4.128) 

Exercise 4.3.6 Show that the maximally correlated state $ , where 



2 



^(|00}(00| AB + |11}(11| AB ), (4.129) 



gives results for local measurements that are the same as those for the maximally entangled 
state |$ + ) . Show that the above parity measurements can distinguish these states. 

Partial Trace 

In general, we would like to determine a local density operator that predicts the outcomes of 



all local measurements without having to resort repeatedly to an analysis like that in (4.118' 



4.123). The general method for determining a local density operator is to employ the partial 



trace operation. For a simple state of the form 

\x){x\®\y){y\, (4.130) 

the partial trace is the following operation: 

\x){x\ Tx{\y)(y\} = \x){x\, (4.131) 

where we "trace out" the second system to determine the local density operator for the first. 
We define it mathematically as acting on any tensor product of rank-one operators (not 
necessarily corresponding to a state) 

\xi){x 2 \®\ yi ){y 2 \, (4.132) 

as follows: 

Tr 2 {|zi)(z 2 | ® I1/1X1/2I} = \xi)(x 2 \ Tx{\ yi ){y 2 \} (4.133) 

= \x 1 )(x 2 \ ( yi \y 2 ). (4.134) 

The subscript "2" of the trace operation indicates that the partial trace acts on the second 
system. It is a linear operation, much like the full trace is a linear operation. 
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Exercise 4.3.7 Show that the partial trace operation is equivalent to 

Tr B {\xi)(x 2 \ A ® \yi)(y2\ B } = Y^(i\ B (\xi)(x 2 \ A ® \yi)(yi\ B ) l*) B ' ( 4 - 135 ) 

i 

for some orthonormal basis {\i) } on Bob's system. 

The most general density operator on two systems A and B is some operator p AB that is 
positive with unit trace. We can obtain the local density operator p A from p AB by tracing 
out the B system: 

p A = Tr B {p AB }. (4.136) 

In more detail, let us expand an arbitrary density operator p with an orthonormal basis 
(K) ® \j) }i,j f° r the bipartite (two-party) state: 

P AB = E A m,m(K) A ® U) B )((A:| A ® (/| B ). (4.137) 

i,j,k,l 

The coefficients Ajj^^ are the matrix elements of p AB with respect to the basis { \i) <S>\j) }ij, 
and they are subject to the constraint of positivity and unit trace for p AB . We can rewrite 
the above operator as 

P AB =Y,>*jM(k\ A ®\J)(l\ B - (4-138) 

i,j,k,l 

We can now evaluate the partial trace: 



/ 



Trsi J2 hj,k,i\i)(k\ A ® U)(^| B | (4-139) 

= E A M , M Tr B {K)(A;| A ® \j)(l\ B } (4.140) 

= E A ^N)(^| A Tr{|i)(/| B } (4.141) 

i,j,k,l 

= J2 X iJ,k,i\i)(k\ A {J\l) (4-142) 

= E a m^'K)( A; I a ( 4 - 143 ) 

= E(E^.w)i»x*i >1 - ( 4 - 144 ) 

The second equality exploits the linearity of the partial trace operation. The last equality 
explicitly shows how the partial trace operation earns its name — it is equivalent to performing 
a trace operation over the coefficients corresponding to Bob's system. 

The next exercise asks you to verify that the operator p A , as defined by the partial trace, 
predicts the results of a local measurement accurately and confirms the role of p A as a local 
density operator. 
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Exercise 4.3.8 Suppose Alice and Bob share a quantum system in a state described by 
the density operator p AB . Consider a local measurement, with measurement operators 
{M m } m , that Alice may perform on her system. The global measurement operators are 
thus {M A (g) I s } . Show that the probabilities predicted by the global density operator are 
the same as those predicted by the local density operator p A where p A =Ttb{p ab }'- 

Tr{(M A ® I B )p AB } = Tr{M A p A }. (4.145) 

Thus, the predictions of the global noisy quantum theory are consistent with the predictions 
of the local noisy quantum theory. 

Exercise 4.3.9 Verify that the partial trace of a product state gives one of the density 
operators in the product state: 

Tr 2 {p <g> a} = p. (4.146) 



This result is consistent with the observation near (4.107) 



Exercise 4.3.10 Verify that the partial trace of a separable state gives the result in (4.115): 



Tr 2 1 ^p z (z)p z ®a z \= ^p z (z)p z . (4.147) 

Exercise 4.3.11 Consider the following density operator that is formally analogous to a 
joint probability distribution px,v{x,y): 

p = ^2pxA x >v)\ x )( x \ ® \v)(vl ( 4 - 148 ) 

where the set of states {1^}}^ and {\y)} each form an orthonormal basis. Show that tracing 
out the second system is formally analogous to taking the marginal distribution px{ x ) = 
^2 v Px,y{ x , y) of the joint distribution Px,y( x , y)- That is, we are left with a density operator 
of the form 

Y,Px(x)\x){x\. (4.149) 

X 

Keep in mind that the partial trace is a generalization of the marginalization because it 
handles more exotic quantum states besides the above "classical" state. 

Exercise 4.3.12 Show that the two partial traces in any order on a bipartite system are 
equivalent to a full trace: 

Tr{p AB } = Tr A {Tr B {p AB }} = Tr B {Tr A {/*}}. (4.150) 

Exercise 4.3.13 Verify that Alice's local density operator does not change if Bob performs a 
unitary operator or a measurement where he does not inform her of the measurement result. 
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4.3.4 Classical-Quantum Ensemble 

We end our overview of composite noisy quantum systems by discussing one last type of 
joint ensemble: the classical- quantum ensemble. This ensemble is a generalization of the 
"ensemble of ensembles" from before. 

Let us consider the following ensemble of density operators: 

{px(x),p x } xex . (4.151) 

The intuition here is that Alice prepares a quantum system in the state p x with probability 
px{x). She then passes this ensemble to Bob, and it is Bob's task to learn about it. He can 
learn about the ensemble if Alice prepares a large number of them in the same way. 

There is generally a loss of the information in the random variable X once Alice has 
prepared this ensemble. Bob can only learn about the distribution of the random variable 
X if each density operator p x is a pure state |x)(x| where the states {1^}}^^ form an 
orthonormal basis. The resulting density operator would be 



P 



J2px(x)\x)(x\. (4.152) 



x£X 



Bob could then perform a measurement with measurement operators {I^X^I}^^, and learn 
about the distribution px{%) with a large number of measurements. 

In the general case, the density operators {p x } xeX d° n °t correspond to pure states, much 
less orthonormal ones, and there is no procedure for Bob to learn about random variable X. 
The density operator of the ensemble is 

P = ^2px(x)p x , (4.153) 

x£X 

and the information about the distribution of random variable X becomes "mixed in" with 
the "mixedness" of the density operators p x . There is then no measurement that Bob 
can perform on p that allows him to learn about the probability distribution of random 
variable X . 

One solution to this issue is for Alice to prepare the following classical-quantum ensemble: 

\ Px (x),\x)(x\ x ®pA , (4.154) 

l- J x&X 

where we label the first system as X and the second as A. She simply correlates a state \x) 
with each density operator p x , where the states {|^}} ;cg ^ form an orthonormal basis. We 
call this ensemble the "classical- quantum" ensemble because the first system is classical and 
the second system is quantum. The density operator of the classical-quantum ensemble is a 
classical- quantum state p XA where 

p XA = I>*(aOi*><*i* ® & ( 4 - 155 ) 

x£X 
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This "enlarged" ensemble lets Bob learn about random variable X while at the same time 
he can learn about the ensemble that Alice prepares. Bob can learn about the distribution 
of random variable X by performing a local measurement of the system X. He also can 
learn about the states p x by performing a measurement on A and combining the result of 
this measurement with the result of the first measurement. The next exercises ask you to 
verify these statements. 

Exercise 4.3.14 Show that a local measurement of system X reproduces the probability 
distribution px(x). Use local measurement operators {I^X^I}^^ to show that 

p x (x) = Tr{p XA (\x)(x\ x <g> / A ) }. (4.156) 

Exercise 4.3.15 Show that performing a measurement with measurement operators {M m } 



on system A is the same as performing a measurement of the ensemble in (4.151 ). Show that 



Tr{pM m } = Tr{p XA (l x ® M A )}, (4.157) 



where p is defined in (4.153). 



4.4 Noisy Evolution 



The evolution of a quantum state is never perfect. In this section, we introduce noise as 
resulting from the loss of information about a quantum system. This loss of information is 
similar to the lack of information about the preparation of a quantum state, as we have seen 
in the previous section. 

4.4.1 Noisy Evolution from a Random Unitary 

We begin with an example, the quantum bit-flip channel. Suppose that we prepare a quantum 
state |V>}- For simplicity, let us suppose that we are able to prepare this state perfectly. 
Suppose that we send this qubit over a quantum bit-flip channel, i.e., the channel applies the 
X Pauli operator (bit-flip operator) with some probability p and applies the identity operator 
with probability 1 — p. We can describe the resulting state as the following ensemble: 

{{p,X\^)},{l-p,\^)}}. (4.158) 

The density operator of this ensemble is as follows: 

pX\1,){<l,\Xl + (l-p)\il>){il>\. (4.159) 

We now generalize the above example by beginning with an ensemble 

{p x (x),\^)} xex (4.160) 
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with density operator p = Ylxex px{x)\ip x ) (ipx\ an d apply the bit-flip channel to this en- 



semble. Given that the input state is 1^), the resulting ensemble is as in (4.158) with \ip) 
replaced by \ifi x ). The overall ensemble is then as follows: 

{{p x (x)p,X\^ x )},{p x (x)(l-p),\^)}} xeX . (4.161) 

We can calculate the density operator of the above ensemble: 

J2Px(x)pX\i(j x )(t(j x \X^ + Px (x)(l - p)\i(j x )(ilj x l (4.162) 

x&X 

and simplify the above density operator by employing the definition of p: 

pXpX ] + (1 - p)p. (4.163) 

The above density operator is more "mixed" than the original density operator and we will 



make this statement more precise in Chapter 10, when we study entropy. 



Random Unitaries 

The generalization of the above discussion is to consider some ensemble of unitaries (a 
random unitary) {p{k), Uk} that we can apply to an ensemble of states {px(x), \4> x )} xe x- ^ 
is straightforward to show that the resulting density operator is 

J2p(k)U kP Ul (4.164) 

k 

where p is the density operator of the ensemble of states. 

4.4.2 Noisy Evolution as the Loss of a Measurement Outcome 

We can also think about noise as arising from the loss of a measurement outcome. Suppose 
that we have an ensemble of states {px{x), \^ x )} x& x an ^ we perform a measurement with 
a set {Mfc} of measurement operators where J2k^l^k = I- First let us suppose that we 
know that the state is l^}- Then the probability of obtaining the measurement outcome k 
is PK\x(k\x) where 

PK\x(k\x) = (^ x \MlM k \^ x ), (4.165) 

and the post-measurement state is 

y/PK\x(k\x) 

Let us now suppose that we lose track of the measurement outcome, or equivalently, someone 
else measures the system and does not inform us of the measurement outcome. The resulting 
ensemble description is then 



\px\K{x\k)p K (k),M k \t/) x )/Jp K \x{k\x) \ 

v- V J x£X ,1 



(4.167) 
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Figure 4.2: We use the diagram on the left to depict a noisy quantum channel A/" that takes a quantum 

system A to a quantum system B. This quantum channel is equivalent to the diagram on the right, where 
some third party performs a measurement on the input system and does not inform the receiver of the 
measurement outcome. 



The density operator of the ensemble is then 

J2px\K{x\k)p K (k) 



M k \i/> x ){1) x \Ml 
PK\x(k\x) 

M k \ip x )(ip x \Ml 



X 



y2PK\x{k\x)p x (~; , . 

J2px(x)M k \i; x )(^ x \Ml 

x,k 

Y, m *p m I 



(4.168) 
(4.169) 
(4.170) 



We can thus write this evolution as a noisy map Af(p) where 



(4.171) 



We derived the map in (4.171 ) from the perspective of the loss of a measurement outcome, 



but it in fact represents a general evolution of a density operator, and the operators M k are 



known as the Kraus operators. We can represent all noisy evolutions in the form (4.171) 



The evolution of the density operator p is trace-preserving because the trace of the resulting 
density operator has unit trace: 



Tr{A/-(p)} = TiJ Yl M *P M l 

= J2Tr{M kP Mt} 

k 

= J2^{M J k M k p} 



(4.172) 
(4.173) 
(4.174) 
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(4.175) 

(4.176) 
(4.177) 

There is another important condition that the map A/*(p) should satisfy: complete positivity. 
Positivity is a special case of complete positivity, and this condition is that the output N(p) 
is a positive operator whenever the input p is a positive operator. Positivity ensures that 
the noisy evolution produces a quantum state as an output whenever the input is a quantum 
state. Complete positivity is that the output of the tensor product map (/ fc ® A/") (a) for any 
finite fc is a positive operator whenever the input a is a positive operator (this input operator 
now lives on a tensor-product Hilbert space). If the input dimension of the noisy map is d, 
then it is sufficient to consider k = d. Complete positivity makes good physical sense because 
we expect that the action of a noisy map on one system of a quantum state and the identity 
on the other part of that quantum state should produce as output another quantum state 
(which is a positive operator). The map N(p) is a completely positive trace-preserving map, 
and any physical evolution is such a map. 



Exercise 4.4.1 Show that the evolution in (4.171) is positive, i.e., the evolution takes a 



positive density operator to a positive density operator. 



Exercise 4.4.2 Show that the evolution in (4.171) is completely positive. 



Exercise 4.4.3 Show that the evolution in (4.171) is linear 



W 5>x(*K = 5>x(20A/"( Px ), (4-178) 



for any probabilities px(x) and density operators p a 



Unitary evolution is a special case of the evolution in (4.171 ). We can think of it merely 
as some measurement where we always know the measurement outcome. That is, it is a 
measurement with one operator U in the set of measurement operators and it satisfies the 
completeness condition because WU = I. 

A completely positive trace-preserving map is the mathematical model that we use for a 
quantum channel in quantum Shannon theory because it represents the most general noisy 
evolution of a quantum state. This evolution is a generalization of the conditional probability 
distribution noise model of classical information theory. To see this, suppose that the input 
density operator p is of the following form: 

p = ^2px(x)\x)(x\, (4.179) 

X 

where {\x)} is some orthonormal basis. We consider a channel TV with the Kraus operators 



{Jp Y \x(y\x)\y)(x\\ , (4.180) 

v- V i X.V 



x,y 
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where \y) and \x) are part of the same basis. Evolution according to this map is then as 
follows: 

M(P) = E yjpr\x(v\*)\v)(x\ (E^')KX*'I J ^PY\x(y\x)\x)(y\ (4.181) 

x,y \ x 1 / 

= E VY\x{y\x)px{x')\{x'\x)\ 2 \y)(y\ (4.182) 

x,y,x' 

= ^2pr\x{y\x)px{x) \y)(y\ (4.183) 

x,y 

= 52\Y,PY\x(y\ x )Px( x n Is/Xs/I- ( 4 - 184 ) 



j/ \ a; 



Thus, the evolution is the same that a noisy classical channel PrixCs/l^) would enact on a 
probability distribution p x (x) by taking it to 

Py(v) = ^ZPY\x(y\x)p x (x) ( 4 - 185 ) 

X 

at the output. 

4.4.3 Noisy Evolution from a Unitary Interaction 

There is another perspective on quantum noise that is useful to consider. It is equivalent 
to the perspective given in Chapter [5] when we discuss isometric evolution. Suppose that a 
quantum system A begins in the state p A and that there is an environment system E in a 
pure state |0) .So the initial state of the joint system AE is 

/®|0}(0f. (4.186) 

Suppose that these two systems interact according to some unitary operator U AE acting on 
the tensor-product space of A and E. If we are only interested in the state a A of the system 
A after the interaction, then we find it by taking the partial trace over the environment E: 

a A = Tr E [u AE (p A ® |0}<0f ) (U AE ) ] ). (4.187) 

This evolution is equivalent to that of a completely-positive, trace-preserving map with Kraus 
operators {Bi = (i\ U |0) }{. This follows easily because we can take the partial trace 
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with respect to an orthonormal basis {\i) } for the environment: 

Tr E {u AE (p A ®\0)(0\ B ){U AE )*} 

= Y,^\ EljAE {p A ® l°)(°| £ ) {U AE ) ] \i) E (4-188) 

i 

= 5>| £ O0) V(0| £ (t/ A£ ) t K) £ (4.189) 

i 

= ^S </tJ Sj. (4.190) 

i 

That the operators {-£>«} are a legitimate set of Kraus operators satisfying Yli ■&}■&% = I A 
follows from the unitarity of U and the orthonormality of the basis {\i) }: 

J2 B l B i = ^( | B (OV> £ <<l^^|0> S (4.191) 

= <0| B (O t 5>><*| £ 0°> £ (4-192) 

= (0| £ ([/ A£ )V A£ |0} £ (4.193) 

= (0| S / A ®/ £ |0} £ (4.194) 

= / A . (4.195) 

4.4.4 Unique Specification of a Noisy Channel 

Consider a given noisy quantum channel M with Kraus representation 

Af(p) = Y / A jP A]. (4.196) 

3 

We can also uniquely specify M by its action on an operator of the form \i)(j\ where {\i)} 
is some orthonormal basis: 

JV y =jV(|i)01). (4.197) 

We can figure out how the channel J\f would act on any density operator if we know how it 
acts on \i)(j\ for all i and j. Thus, two channels M and M. are equivalent if they have the 
same effect on all operators of the form \i)(j\: 

M = M & Vi,j m\i)(J\)=M(\i)(j\). (4.198) 

Let us now consider a maximally entangled qudit state |$) where 

v a n 
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and d is the dimension of each system A and B. The density operator & AB of such a state 
is as follows: 



_. a—L 

$AB = ^5>>01 A ®K>0f- (4-200) 

i,j=0 

Let us now send the A system of § AB through the noisy quantum channel A/": 

{N A ® I B ) {$ AB ) = ^J2 M A (\i)(j\ A ) ® \i)(j\ B - (4.201) 

The resulting state completely characterizes the noisy channel TV because the following 



map translates between the state in (4.201) and the operators Nij in (4.197): 

d(i'\ (Af A ® I B ) ($ AB ) \f) B = N i:i . (4.202) 

Thus, we can completely characterize a noisy map by determining the quantum state result- 
ing from sending half of a maximally entangled state through it, and the following condition 
is sufficient for any two noisy channels to be equivalent: 

M = M & (N A ®I B )(<S> AB ) = (M A ®I B ){<S> AB ). (4.203) 



It is equivalent to the condition in (4.198). 



4.4.5 Concatenation of Noisy Maps 

A quantum state may undergo not just one type of noisy evolution — it can of course undergo 
one noisy quantum channel followed by another noisy quantum channel. Let A/"i denote a 
first noisy evolution and let A/2 denote a second noisy evolution. Suppose that the Kraus 
operators of A/i are {Ak} and the Kraus operators of A/2 are {-Bfc}. It is straightforward to 
define the concatenation A/2 ° A/i of these two maps. Consider that the output of the first 
map is 

Af x (p) = J2 A kPAl (4.204) 

k 
for some input density operator p. The output of the concatenation map A/2 ° A/i is then 

(A/" 2 oM)(p) = J2 b ^(p) b I = J2 B * A «P A l' B i ( 4 - 205 ) 

It is clear that the Kraus operators of the concatenation map are {BkAk>} kk ,. 

4.4.6 Important Examples of Noisy Evolution 

This section discusses some of the most important examples of noisy evolutions that we 
will consider in this book. Throughout this book, we will be considering the information- 
carrying ability of these various channels. They will provide some useful, "hands on" insight 
into quantum Shannon theory. 
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Dephasing Channel 



We have already given the example of a noisy quantum bit flip channel in Section 4.4.1 
Another important example is a bit flip in the conjugate basis, or equivalently, a phase flip 
channel. This channel acts as follows on any given density operator: 

p^ (1 -p)p + pZpZ. (4.206) 

It is also known as the dephasing channel. 

For p = 1/2, the action of the dephasing channel on a given quantum state is equivalent 
to the action of measuring the qubit in the computational basis and forgetting the result of 
the measurement. We make this idea more clear with an example. First, suppose that we 
have a qubit 

|^}=a|0}+/?|l), (4.207) 

and we measure it in the computational basis. Then the postulates of quantum theory state 
that the qubit becomes |0) with probability |a| and it becomes |1) with probability \/3\ . 
Suppose that we forget the measurement outcome, or alternatively, we do not have access 
to it. Then our best description of the qubit is with the following ensemble: 

{{M 2 ,|0}},{|/?| 2 ,|1}}}. (4.208) 

The density operator of this ensemble is 

H 2 |0}<0| + |/?| 2 |1)(1|. (4.209) 

Now let us check if the dephasing channel gives the same behavior as the forgetful mea- 
surement above. We can consider the qubit as being an ensemble {1, \ifi)}, i.e., the state is 
certain to be \ip). The density operator of the ensemble is then p where 

p=\a\ 2 \0)(0\+a(3*\0)(l\+a*(3\l)(0\ + |/3| 2 |1)(1|. (4.210) 

If we act on the density operator p with the dephasing channel with p = 1/2, then it preserves 
the density operator with probability 1/2 and phase flips the qubit with probability 1/2: 

\p + \z P z 

= ^(|a| 2 |0)(0| + a/3*|0>(l| + a*/3|l}<0| + |/3| 2 |1}(1|) + 

^(|a| 2 |0)(0| - a/r|0)<l| - a*/3|l><0| + |/5| 2 |1)(1|) (4.211) 

= |a| 2 |0)(0| + |/?| 2 |l)(l|. (4.212) 

The dephasing channel nullifies the off-diagonal terms in the density operator with respect 
to the computational basis. The resulting density operator description is the same as what 
we found for the forgetful measurement. 
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Exercise 4.4.4 Verify that the action of the dephasing channel on the Bloch vector is 

!(/ + (!_ 2 p)r x X + (1 - 2p)r y Y + r z Z), (4.213) 



so that the channel preserves any component of the Bloch vector in the Z direction, while 
shrinking any component in the X ox Y direction. 

Pauli Channel 

A Pauli channel is a generalization of the above dephasing channel and the bit flip channel. 

It simply applies a random Pauli operator according to a probability distribution. The map 

for a qubit Pauli channel is 

l 

p^ J2p^^ z%xj p x3ZI - ( 4 - 214 ) 

The generalization of this channel to qudits is straightforward. We simply replace the Pauli 
operators with the Heisenberg-Weyl operators. The Pauli qudit channel is 

d-\ 

P^J2 P(hJ)Zm(j)pX\j)Z\i). (4.215) 

i,j=0 

These channels are important in the study of quantum key distribution (QKD) because an 
eavesdropper induces such a channel in a QKD protocol. 

Exercise 4.4.5 We can write a Pauli channel as 

p -»■ pip + PxXpX + p Y Y P Y + p z ZpZ. (4.216) 

Verify that the action of the Pauli channel on the Bloch vector is 

{{pi + Px ~Py ~Pz)r x , {pi + Py - Px - Pz)r y , (pi + pz - Px - Py)t z ). (4.217) 

Depolarizing Channel 

The depolarizing channel is a "worst-case scenario" channel. It assumes that we just com- 
pletely lose the input qubit with some probability, i.e., it replaces the lost qubit with the 
maximally mixed state. The map for the depolarizing channel is 

p^ (l- P )p + jm, (4.218) 

where ti is the maximally mixed state: -k = 1/2. 
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Most of the time, this channel is too pessimistic. Usually, we can learn something about 
the physical nature of the channel by some estimation process. We should only consider 
using the depolarizing channel as a model if we have little to no information about the 
actual physical channel. 

Exercise 4.4.6 (Pauli Twirl) Show that randomly applying the Pauli operators I, X, Y , 
Z with uniform probability to any density operator gives the maximally mixed state: 

\ p + l - X pX + -YpY + l -ZpZ = it. (4.219) 

(Hint: Represent the density operator as p = (I + r x X + r y Y + r z Z)/2 and apply the com- 
mutation rules of the Pauli operators.) This is known as the "twirling" operation. 

Exercise 4.4.7 Show that we can rewrite the depolarizing channel as the following Pauli 
channel: 

p _ (i _ 3p/ 4 ) p + p (^-XpX + -YpY + \z P Z^ . (4.220) 

Exercise 4.4.8 Show that the action of a depolarizing channel on the Bloch vector is 

{r x ,r y ,r z )^((l-p)r x , (1 - p)r v , (l-p)r g ). (4.221) 

Thus, it uniformly shrinks the Bloch vector to become closer to the maximally mixed state. 

The generalization of the depolarizing channel to qudits is again straightforward. It is 



the same as the map in (4.218), with the exception that the density operators p and it are 
qudit density operators. 

Exercise 4.4.9 (Qudit Twirl) Show that randomly applying the Heisenberg-Weyl opera- 
tors 

{X(i)Z(j)} ije{0 _ d _ l} (4.222) 

with uniform probability to any qudit density operator gives the maximally mixed state it: 

d-l 

- J2 X(t)Z{j)pZ\j)X^) = it. (4.223) 

i,j=0 

(Hint: You can do the full calculation, or you can decompose this channel into the composi- 
tion of two completely dephasing channels where the first is a dephasing in the computational 
basis and the next is a dephasing in the conjugate basis). 
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Amplitude Damping Channel 

The amplitude damping channel is a first-order approximation to a noisy evolution that 
occurs in many physical systems ranging from optical systems to chains of spin- 1/2 particles 
to spontaneous emission of a photon from an atom. 

In order to motivate this channel, we give a physical interpretation to our computational 
basis states. Let us think of the |0) state as the ground state of a two-level atom and let us 
think of the state |1) as the excited state of the atom. Spontaneous emission is a process 
that tends to decay the atom from its excited state to its ground state, even if the atom is in 
a superposition of the ground and excited states. Let the parameter 7 denote the probability 
of decay so that < 7 < 1. One Kraus operator that captures the decaying behavior is 

(4.224) 



(4.225) 



A = V7|0)(1|. 
The operator Aq annihilates the ground state: 

4>|o)<o|4 = o, 

and it decays the excited state to the ground state: 

A |l)(lK = 7 |0>(0|. 



(4.226) 



The Kraus operator Aq alone does not specify a physical map because A Aq = 7|1}(1| (recall 
that the Kraus operators of any channel should satisfy the condition ^2 k A k Ak = I)- We 
can satisfy this condition by choosing another operator A 1 such that 



A\A, = I 



4a 



+ (1-7)|1)(1|- 
The following choice of A 1 satisfies the above condition: 

A, 



(4.227) 
(4.228) 



|0}(0| + v / T^7|l)(l|. (4.229) 

Thus, the operators Aq and A\ are valid Kraus operators for the amplitude damping channel. 

Exercise 4.4.10 Consider a single-qubit density operator with the following matrix repre- 
sentation with respect to the computational basis: 



P 



1 — p i] 
rf p 



(4.230) 



where < p < 1 and r] is some complex number. Show that applying the amplitude damping 
channel with parameter 7 to a qubit with the above density operator gives a density operator 
with the following matrix representation: 



1 



(1 - 7)p \A -7*7 
VI -7*7* (1 - l)P 



(4.231) 
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Exercise 4.4.11 Show that the amplitude damping channel obeys a composition rule. Con- 
sider an amplitude damping channel A/1 with transmission parameter (1 — 71) and consider 
another amplitude damping channel A/2 with transmission parameter (1 — 72). Show that 
the composition channel A/2 o A/"i is an amplitude damping channel with transmission pa- 
rameter (1 — 7i)(l — 72)- (Note that the transmission parameter is equal to one minus the 
damping parameter.) 

Erasure Channel 

The erasure channel is another important channel in quantum Shannon theory. It admits a 
simple model and is amenable to relatively straightforward analysis when we later discuss 
its capacity. The erasure channel can serve as a simplified model of photon loss in optical 
systems. 

We first recall the classical definition of an erasure channel. A classical erasure channel 
either transmits a bit with some probability 1 — e or replaces it with an erasure symbol e 
with some probability e. The output alphabet contains one more symbol than the input 
alphabet, namely, the erasure symbol e. 

The generalization of the classical erasure channel to the quantum world is straightfor- 
ward. It implements the following map: 

p^(l-s)p + s\e){e\, (4.232) 

where |e) is some state that is not in the input Hilbert space, and thus is orthogonal to it. 
The output space of the erasure channel is larger than its input space by one dimension. 
The interpretation of the quantum erasure channel is similar to that for the classical erasure 
channel. It transmits a qubit with probability 1 — e and "erases" it (replaces it with an 
orthogonal erasure state) with probability e. 

Exercise 4.4.12 Show that the following operators are the Kraus operators for the quantum 
erasure channel: 

\/I^(|0} B (0| A + |l} i? (l| A ), (4.233) 

^|e) B (0| A , (4.234) 

^\e) B {l\ A . (4.235) 

At the receiving end of the channel, a simple measurement can determine whether an 
erasure has occurred. We perform a measurement with measurement operators {II in , |e)(e|}, 
where Il in is the projector onto the input Hilbert space. This measurement has the benefit of 
detecting no more information than necessary. It merely detects whether an erasure occurs, 
and thus preserves the quantum information at the input if an erasure does not occur. 

Classical-Quantum Channel 

A classical-quantum channel is one that first measures the input state in a particular or- 
thonormal basis and outputs a density operator conditional on the result of the measurement. 
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A 




I 



I B 



Figure 4.3: The above figure illustrates the internal workings of a classical-quantum channel. It first 
measures the input state in some basis {\k)} and outputs a quantum state a k conditional on the measurement 
outcome. 



Suppose that the input to the channel is a density operator p. Suppose that {|&)} fe is an 
orthonormal basis for the Hilbert space on which the density operator p acts. The classical- 
quantum channel first measures the input state in the basis {|A;}}. Given that the result of 
the measurement is k, the post measurement state is 



|fc)(fc|p|fc)(A;| 



(4.236) 



The classical-quantum channel correlates a density operator a k with the post-measurement 

state k: 

\k)(k\p\k)(k\ 



This action leads to an ensemble: 



(k\p\k) 



(k\p\k) 

\k)(k\p\k)(k\ 
' (k\p\k) 



Ok- 



°k 



and the density operator of the ensemble is 

\k)(k\p\k)(k\ < 



£<*w*>- 



(k\ P \k) 



<J k = ^\k)(k\p\k)(k\®a k . 



(4.237) 



(4.238) 



(4.239) 



The channel then only outputs the system on the right (tracing out the first system) so that 
the resulting channel is as follows: 



M{p) = ^(k\p\k)a k . 



(4.240) 



Figure |4T3] depicts the behavior of the classical-quantum channel. This channel is also known 
as an entanglement-breaking channel, for reasons that become clear in the next exercise. 

Exercise 4.4.13 Show that the classical-quantum channel is an entanglement-breaking chan- 
nel — i.e., if we input the B system of an entangled state ip AB , then the resulting state on 
AB is no longer entangled. 
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We can prove a more general structural theorem regarding entanglement- breaking chan- 
nels by exploiting the observation in the above exercise. 

Theorem 4.4.1. An entanglement-breaking channel has a representation with Kraus oper- 
ators that are unit rank. 

Proof. Consider that the output of an entanglement-breaking channel A/"eb acting on half of 
a maximally entangled state is as follows: 

Mi* B \$ BA ) = Y,Pz(z)\^)(^\ B ® m(^\ B '. (4.241) 

z 

This holds because the output of a channel is a separable state (it "breaks" entanglement), 
and it is always possible to find a representation of the separable state with pure states (see 



Exercise 4.3.3). Now consider constructing a channel M. with the following unit-rank Kraus 
operators: 

A z = {^dp z (z)\ip z )(c!>* z \} z , (4.242) 

where \4>* z ) is the state \(f> z ) with all of its elements conjugated. We should first verify that 
these Kraus operators form a valid channel, by checking that ^2 Z A\A Z = I: 



^AlA^^dpzizm^Mi^l 

Z Z 


(4.243) 


= dj2pz(zw z )(n 

z 


(4.244) 


Consider that 




Tr B ,{Afiv B '($ BA )}=n B 


(4.245) 


= TrJ y,pz{z)\^ z ){^ z \ b ® \i> z )m B ' I 


(4.246) 


= ^2pz(z)\(f> z )(<t> z \ B , 


(4.247) 



where ir B is the maximally mixed state. Thus, it follows that M. is a valid quantum channel 
because 

d Y,Pz{z)\(t> z ){(t>z\ B = d7, B (4.248) 

2 

= I B (4.249) 

= (I B )* (4.250) 

= d Y,Pz{z)W z )m (4.251) 

z 

= J2 A lA z . (4.252) 
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Now let us consider the action of the channel M. on the maximally entangled state: 

M A ^ B \<5> BA ) (4.253) 

= ^El i )O1 B ®V^Ml^)(0:iK)(ill0:)(^| B '^P^) (4-254) 

= !>(*) \i){J\ B ® <CI*)0K> I^X^I*' (4-255) 

= £**(*) WOKKftWOf® ItkXtM* (4-256) 

= J>z(«) \<f> z )(4> z \ B ® Mtyzf (4.257) 

z 

The last equality follows from recognizing '^2 i Ai){j\ ■ IOC? I as the transpose operation and 
noting that the transpose is equivalent to conjugation for an Hermitian operator \(j) z )((f) z \. 
Finally, since the action of both M^ and Ai A ^ B on the maximally entangled state is the 



same, we can conclude that the two channels are equivalent (see Section 4.4.4). Thus, M. is 



a representation of the channel with unit-rank Kraus operators. □ 



4.4.7 Quantum Instrument 

The description of a quantum channel with Kraus operators gives the most general evolution 
that a quantum state can undergo. We may want to specialize this definition somewhat for 
another scenario. Suppose that we would like to determine the most general evolution where 
the input is a quantum state and the output is both a quantum state and a classical variable. 
Such a scenario may arise in a case where Alice is trying to transmit both classical and 
quantum information, and Bob exploits a quantum instrument to decode both the classical 
and quantum systems. A quantum instrument gives such an evolution with a hybrid output. 
Recall that we may view a noisy quantum channel as arising from the forgetting of a 



measurement outcome, as in (4.171). Let us now suppose that some third party performs a 
measurement with two outcomes j and k, but does not give us access to the measurement 
outcome j. Suppose that the measurement operators for this two-outcome measurement 
are {M^}.,. Let us first suppose that the third party performs the measurement on a 
quantum system with density operator p and gives us both of the measurement outcomes. 
The post-measurement state in such a scenario is 

M jk pMt 

,. it (4-258) 

where the joint distribution of outcomes j and k are 

Pj , K (j, k) = Ti{M] k M jM p}. (4.259) 
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We can calculate the marginal distributions pj(j) and Pxik) according to the law of total 
probability: 



k k 

Px(k) = ^2pj,Kti,k) = J2 T *i M lkM j ,kp}- 



(4.260) 
(4.261) 



Suppose the measuring device also places the classical outcomes in classical registers J and 
K, so that the post-measurement state is 



M jk pM], , „ 

Pj,k{j,k) 



(4.262) 



where the sets {\j)} and {\k)} form respective orthonormal bases. Such an operation is 
possible physically, and we could retrieve the classical information at some later point by 
performing a von Neumann measurement of the registers J and K. If we would like to 
determine the Kraus map for the overall quantum operation, we simply take the expectation 
over all measurement outcomes j and k: 



Y,P'*V> fc ) ( MhkpM 'i I ® li>01 J ® \k)(k\ K 



j-M 



PJ,K(j,k) 



Y.M^pMl,®^)^ ®\k){k\ K . (4.263) 



Let us now suppose that we do not have access to the measurement result k. This lack 
of access is equivalent to lacking access to classical register K . To determine the resulting 
state, we should trace out the classical register K . Our map then becomes 



Y,M hkP Ml k ®\j)(j[ 



(4.264) 
The above map corresponds to a quantum instrument, and is a general noisy quantum 



evolution that produces both a quantum output and a classical output. Figure |4T4| depicts a 
quantum instrument. 

We can rewrite the above map more explicitly as follows: 



where we define 



j \ k / j 

Ejip) = J2Mj, k pMl k . 

k 



(4.265) 



(4.266) 
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Figure 4.4: The figure on the left illustrates a quantum instrument, a general noisy evolution that produces 
both a quantum and classical output. The figure on the right illustrates the internal workings of a quantum 
instrument, showing that it results from having only partial access to a measurement outcome. 



Each j-dependent map £j(p) is a completely positive trace-reducing map because Tr{£j(p)} < 



1. In fact, by examining the definition of £j{p) and comparing to (4.260), it holds that 



Tr{£ j (p)}=p J (j). 



(4.267) 



It is important to note that the probability p,j(j) is dependent on the density operator p 
that is input to the instrument. We can determine the quantum output of the instrument 
by tracing over the classical register J. The resulting quantum output is then 



Tofe^(p)®|j)01 J |=^^(p). 



(4.268) 



3 '3 

The above "sum map" is a trace-preserving map because 



= ^2pjU) 

3 
= 1, 



(4.269) 
(4.270) 
(4.271) 



where the last equality follows because the marginal probabilities pj(j) sum to one. The 
above points that we have mentioned are the most salient for the quantum instrument. We 
will exploit this type of evolution when we require a device that outputs both a classical and 
quantum system. 

Exercise 4.4.14 Suppose that you have a set of completely-positive trace- preserving maps 
{£ m }. Design a quantum instrument by modifying these maps in any way that you wish. 

We should stress that a quantum instrument is more general than applying a mixture of 
CPTP maps to a quantum state. Suppose that we apply a mixture {A/}} of CPTP maps to 
a quantum state p, chosen according to a distribution pj(j). The resulting expected state is 
as follows: 

(4.272) 



E^t?')U><il J ^(p)- 
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Figure 4.5: The figure on the left depicts a general operation, a conditional quantum encoder, that takes a 
classical system to a quantum system. The figure on the right depicts the inner workings of the conditional 
quantum encoder. 



The probabilities pj(j) here are independent of the state p that is input to the mixture 
of CPTP maps, but this is not generally the case for a quantum instrument. There, the 
probabilities pj(j) can depend on the state p that is input — it may be beneficial then to 
write these probabilities as pj(j\p) because there is an implicit conditioning on the state 
that is input to the instrument. 



4.4.8 Conditional Quantum Channel 

We end this chapter by considering one final type of evolution. A conditional quantum 
encoder g MA ^ B 7 or conditional quantum channel , is a collection {S^ 3 } °f CPTP maps. 
Its inputs are a classical system M and a quantum system A and its output is a quantum 
system B. A conditional quantum encoder can function as an encoder of both classical and 
quantum information. 

A classical-quantum state p MA , where 



P 



MA 



y ^p{m)\rn){rn\ 



M 



A 
Pmi 



(4.273) 



can act as an input to a conditional quantum encoder £ MA ^ B . The action of the conditional 
quantum encoder £ MA ~' B on the classical-quantum state p MA is as follows: 



£ MA - B (p MA ) = Tr M 5>(m)|m)H M ® £^ B {p A m ) 



(4.274) 



Figure |4.5| depicts the behavior of the conditional quantum encoder. 

It is actually possible to write any quantum channel as a conditional quantum encoder 
when its input is a classical-quantum state. Indeed, consider any quantum channel J\f XA ^ B 
that has input systems X and A and output system B. Suppose the Kraus decomposition 
of this channel is as follows: 

^ B (P)^V4 (4-275) 

3 

Suppose now that the input to the channel is the following classical-quantum state: 



XA 



a' 



^2px(x)\x){x\ 



pI 



(4.276) 
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Then the channel J\f XA ^ B acts as follows on the classical-quantum state a XA : 

N XA - B {o XA ) = Y, A ^{x)\x){xf ® p A )A). 



(4.277) 



.;..'■ 



Consider that a classical- quantum state admits the following matrix representation by ex- 
ploiting the tensor product: 



J2px{x)\x){x\ X ® p x 



xex 



Px{x 2 )p A 2 


Q)Px{x)p x . 

x£X 



p x {x\x\)pi 



(4.278) 



(4.279) 



(4.280) 



It is possible to specify a matrix representation for each Kraus operator Aj in terms of \X\ 
block matrices: 

Aj=[A jA A jt2 ••• A mi }. (4.281) 



Each operator Aj(px(x) \ x)(x \ ® p A \A, in the sum in (4.277) then takes the following form: 



(4.282) 



[A jA A h2 ■■■ A m ] 



Pxix^p^ 






Px(x\x\)p. 



A 

x \x\ 






A 



1,\x\. 



= Yl Px{x)A hX p A A) x . 

xe\X\ 

We can write the overall map as follows: 



(4.283) 



(4.284) 



N XA - B {a XA ) =EEM^i 

j x&X 

= J2Px( x )J2 A i>*Px A h 



x&X 



J2px(xW A - B (p A ), 



(4.285) 
(4.286) 
(4.287) 



xGX 
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where we define each map J\f A ^ B as follows: 

*t B fef) = E A ^ A l- ( 4 - 288 ) 

3 

Thus, the action of any quantum channel on a classical- quantum state is the same as the 
action of the conditional quantum encoder. 

Exercise 4.4.15 Show that the condition J2^A-Aj = I implies the \X\ conditions: 

Vx E X : Y^ A], x A j>x = I. (4.289) 

3 

4.5 Summary 

We give a brief summary of the main results in this chapter. We derived all of these results 
from the noiseless quantum theory and an ensemble viewpoint. An alternate viewpoint is 
to say that the density operator is the state of the system and then give the postulates of 
quantum mechanics in terms of the density operator. Regardless of which viewpoint you view 
as more fundamental, they are consistent with each other in standard quantum mechanics. 
The density operator p for an ensemble {px(x), \ip x )} is the following expectation: 

P = '52px(x)\il> x )(il> x \. (4.290) 

X 

The evolution of the density operator according to a unitary operator U is 

p -> UpU ] . (4.291) 

A measurement of the state according to a measurement {Mj} where J? . M,Mj = I leads 
to the following post-measurement state: 



rt 

PjU) 
where the probability pj(j) for obtaining outcome j is 



P - ^r^r, (4-292) 



III 



(j) = Tr{MJM jP }. (4.293) 



The most general noisy evolution that a quantum state can undergo is according to a 
completely-positive, trace-preserving map N{p) that we can write as follows: 

M(p) = J2A jP A], (4.294) 
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where ^V AjAj = I. A special case of this evolution is a quantum instrument. A quantum 
instrument has a quantum input and a classical and quantum output. The most general way 
to represent a quantum instrument is as follows: 

p-5>i(p)®ij>oi J , (4 - 295) 

3 

where each map Sj is a completely-positive, trace-reducing map, where 

k 

and Y^ijk A jk A 3,k = h so that the overall map is trace-preserving. 

4.6 History and Further Reading 

The book of Nielsen and Chuang gives an excellent introduction to noisy quantum chan- 
nels [197] . Horodecki, Shor, and Ruskai introduced entanglement- breaking channels and 



proved several properties of them (e.g., the proof of Theorem 4.4.1 ) [150J. Davies and Lewis 
introduced the quantum instrument formalism |65j . and Ozawa further elaborated it |201| . 
Grassl et al. introduced the quantum erasure channel and constructed some simple quantum 
error- correcting codes for it [114] . A discussion of the conditional quantum channel appears 
in Yard's thesis [266J. 
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The Purified Quantum Theory 



The final chapter of our development of the quantum theory gives perhaps the most pow- 
erful viewpoint, by providing a mathematical tool, the purification theorem, which offers a 
completely different way of thinking about noise in quantum systems. This theorem shows 
that our lack of information about a set of quantum states can arise from entanglement with 
another system to which we do not have access. The system to which we do not have access 
is known as a purification. In this purified view of the quantum theory, noisy evolution 
arises from the interaction of a quantum system with its environment. The interaction of 
a quantum system with its environment leads to correlations between the quantum system 
and its environment, and this interaction leads to a loss of information because we cannot 
access the environment. The environment is thus the purification of the output of the noisy 
quantum channel. 

In Chapter [3j we introduced the noiseless quantum theory. The noiseless quantum theory 
is a useful theory to learn so that we can begin to grasp an intuition for some uniquely 
quantum behavior, but it is an idealized model of quantum information processing. In 
Chapter |4| we introduced the noisy quantum theory as a generalization of the noiseless 
quantum theory. The noisy quantum theory can describe the behavior of imperfect quantum 
systems that are subject to noise. 

In this chapter, we actually show that we can view the noisy quantum theory as a 
special case of the noiseless quantum theory. This relation may seem bizarre at first, but 
the purification theorem allows us to make this connection. The quantum theory that we 
present in this chapter is a noiseless quantum theory, but we name it the purified quantum 
theory, in order to distinguish it from the description of the noiseless quantum theory in 
Chapter [3J 

The purified quantum theory shows that it is possible to view noise as resulting from 
entanglement of a system with another system. We have actually seen a glimpse of this 
phenomenon in the previous chapter when we introduced the notion of the local density 
operator, but we did not highlight it in detail there. The example was the maximally 
entangled Bell state |$ + ) . This state is a pure state on the two systems A and B, but 
the local density operator of Alice is the maximally mixed state w . We saw that the local 
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1>) 



RA <?> 



Figure 5.1: The above diagram denotes the purification \tp) of a density matrix p . The above diagram 
indicates that the reference system R is entangled with the system A. The purification theorem states that 
the noise inherent in a density matrix p is due to entanglement with a reference system R. We typically use 
the color purple throughout to indicate a system that is not accessible to the parties involved in a protocol. 



density operator is a mathematical object that allows us to make all the predictions about 
any local measurement or evolution. We also have seen that a density operator arises from 
an ensemble, but there is also the reverse interpretation, that an ensemble corresponds to 
the spectral decomposition of any density operator. There is a sense in which we can view 
this local density operator as arising from an ensemble where we choose the states |0) and |1) 
with equal probability 1/2. The purification idea goes as far to say that the noisy ensemble 
for Alice with density operator ir A arises from the entanglement of her system with Bob's. 
We explore this idea in more detail in this final chapter on the quantum theory. 



5.1 Purification 



Suppose we are given a density operator p A on a system A and suppose that its spectral 
decomposition is as follows: 

X 

We can associate the ensemble {px(x), \x)} to this density operator according to its spectral 
decomposition. 

Definition 5.1.1 (Purification). A purification of p A is a pure bipartite state \ip) on a 
reference system R and the original system A. The purification state \tp) has the property 



that the reduced state on system A is equal to p in (5.1): 



p A =Tr R {\ip)(ip\ RA }. (5.2) 



Any density operator p A has a purification \if)) . We claim that the following state 
is a purification of p A : 



RA 



Yl Vpx(x)\x) r \x) a , (5.3) 



where the set {\x) } x of vectors are some set of orthonormal vectors for the reference system 
R. The next exercise asks you to verify this claim. 
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Exercise 5.1.1 Show that the state \ip) , as defined in the above proof, is a purification 
of the density operator p A . 

The purification idea has an interesting physical implication — it implies that we can think 
of our lack of knowledge about a particular quantum system as being due to entanglement 
with some external reference system to which we do not have access. That is, we can 
think that the density operator p A with corresponding ensemble {px(%), \x)} arises from the 
entanglement of the system A with the reference system R and from our lack of access to 
the system R. 

Stated another way, the purification idea gives us a fundamentally different way to in- 
terpret noise. The interpretation is that any noise on a local system is due to entanglement 
with another system to which we do not have access. This interpretation extends to the 
noise from a noisy quantum channel. We can view this noise as arising from the interaction 
of the system that we possess with an external environment over which we have no control. 

The global state is a pure state, but a reduced state is not a pure state in general because 
we trace over the reference system to obtain it. A reduced state is pure if and only if the 
global state is a pure product state. 

Exercise 5.1.2 Show that all purifications are related by a unitary operator on the reference 
system. 

Exercise 5.1.3 Find a purification of the following classical-quantum state: 

Y J Px{x)\x){x\ X ®pt (5.4) 

x 

Exercise 5.1.4 Let \j>x{x),p A } be an ensemble of density operators. Suppose that \ip x ) 
is a purification of p A . The expected density operator of the ensemble is 

P A = ^2px(x)pt (5.5) 

X 

Find a purification of p A . 

5.1.1 Extension of a Quantum System 

We can also define an extension of a quantum system p A . It is some noisy quantum system 
fl RA such that 

p A = Tr R {Q RA }. (5.6) 

This definition is useful sometimes, but we can always find a purification of the extended 

state. 



5.2 Isometric Evolution 



A noisy quantum channel admits a purification as well. We motivate this idea with a simple 
example. 
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5.2.1 Isometric Extension of the Bit-Flip Channel 



Consider the bit-flip channel from (4.159 ) — it applies the identity operator with some probability 1- 



p and applies the bit flip Pauli operator X with probability p. Suppose that we input a qubit 
system A in the state |-0) to this channel. The ensemble corresponding to the state at the 
output has the following form: 



{{l-p,|^}},{p,X|^)}}, 



and the density operator of the resulting state is 



{l-pwm+pxmwx- 



(5.7) 



(5i 



The following state is a purification of the above density operator (you should quickly check 
that this relation holds): 



Vi^p\^) A \o) E + Vpx\^) A \iy 



(5.9) 



We label the original system as A and label the purification system as E. In this context, 
we can view the purification system as the environment of the channel. 

There is another way for interpreting the dynamics of the above bit-flip channel. Instead 
of determining the ensemble for the channel and then purifying, we can say that the channel 
directly implements the following map from the system A to the larger joint system AE: 



xA-pI^Io) 



E 



y/pX\1>) A \\) E . 



(5.10) 



We see that any positive p, i.e., any amount of noise in the channel, can lead to entanglement 
of the input system with the environment E. We then obtain the noisy dynamics of the 
channel by discarding (tracing out) the environment system E. 



Exercise 5.2.1 Find two input states for which the map in (5.10) does not lead to entan- 
glement between systems A and E. 

The above map is an isometric extension of the bit-flip channel. Let us label it as 
tjA^ae w h e re the notation indicates that the input system is A and the output system is 
AE. An isometry is similar to a unitary operator but different because it maps states on one 
input system to states on a joint system. It does not admit a square matrix representation, 
but instead admits a rectangular matrix representation. The matrix representation of this 
isometric operation consists of the following matrix elements: 



(0^(0^*7 

(0\ A (1\ E U A 
(1\ A (0\ E U 



A-^AE 



*AE 



A^AE 



lit/ 



E T jA-^AE 



|0) A <0| 
|0} A <0| 

W 

w 



\ E u A ~ 

EttA- 



lfU J 
A (0\ E U A ' 

EttA- 



:i\ A a 



J U J 



>AE\ 



>AE\ 



>AE\ 



>AE\ 



1) A 
1) A 
1) A 
if] 



o 

o 

Vp 



o 

Vp 

V^v 

o 



(5.H) 



There is no reason that we have to choose the environment states as we did in (5.10). We 
could have chosen the environment states to be any orthonormal basis — isometric behavior 
only requires that the states on the environment be distinguishable. This is related to the 



fact that all purifications are related by a unitary on the purifying system (see Exercise 5.1.2 ). 
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An Isometry is Part of a Unitary on a Larger System 



We can view the dynamics in (5.10) as an interaction between an initially pure environment 
and the qubit state \ip). So, an equivalent way to implement the isometric mapping is 
with a two-step procedure. We first assume that the environment of the channel is in a pure 
state |0) before the interaction begins. The joint state of the qubit \i/j) and the environment 

is 

\^) A \0) E . (5.12) 

These two systems then interact according to a unitary operator. We can specify two columns 
of the unitary operator (we make this more clear in a bit) by means of the isometric mapping 



in (5.10): 



V 



AE 



\^) A \0) E 



V^p\^) a \o) e + VpX\^) a \i) e . 



(5.13) 



In order to specify the full unitary V AE , we must also specify how it behaves when the initial 
state of the qubit and the environment is 



) A \l) E . 



We choose the mapping to be as follows so that the overall interaction is unitary: 

Ji=j>x\i>) A \i) E . 



V 



AE\ 



) A \l) E 



^#>>> S 



(5.14) 



(5.15) 



Exercise 5.2.2 Check that the operator V AE is unitary by determining its action on the 
computational basis < |0) |0) , |0) |1) , |1) |0) , |1) |1) > and showing that all of the out- 
puts for each of these inputs forms an orthonormal basis. 



Exercise 5.2.3 Verify that the matrix representation of the full unitary operator V is 



<l-p 


Vp o 








o Vp 


-Vi -p 





o vi-p 


Vp 


Vp 


-Vi -p o 






(5.16) 



by considering the matrix elements (i\ (j\ V\k) \l) . 

The Complementary Channel 

We may not only be interested in the receiver's output of the quantum channel. We may also 
be interested in determining the environment's output from the channel. This idea becomes 
increasingly important as we proceed in our study of quantum Shannon theory. We should 
consider all parties in a quantum protocol, and the purified quantum theory allows us to do 
so. We consider the environment as one of the parties in a quantum protocol because the 
environment could also be receiving some quantum information from the sender. 

We can obtain the environment's output from the quantum channel simply by tracing 
out every system besides the environment. The map from the sender to the environment is 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



158 CHAPTER 5. THE PURIFIED QUANTUM THEORY 



known as a complementary channel. In our example of the isometric extension of the bit flip 



channel in (5.10), we can check that the environment receives the following density operator 
Tr A { (yi^l^lo)* + ^X\^) A \l) E ) (^p~{ij\ A (0\ E + ^\ A X(l\ E ) } 
= Tr A {(l - P mtiP\ A \0)(0\ E + VpU^P)X[iP)(^\ A \1)(0\ E } 

+ Tr A {^p~{l^p)\i;)(i;\ A X\0)(l\ E +pX\i;)^\ A X\l)(l\ E } (5.17) 

= (1 - p)\0)(0\ E + ^p~o^pj^\x\4,)\i)(o\ E 

+ ^Pi^piim^milf +P\1)(M E (5.18) 

= (l-p)\0)(0\ E + ^/pJl^^\X\^(\l)(0\ E + \0)(l\ E )+p\l)(l\ E (5.19) 

= (1 - p)\0)(0\ E + v / p0 3 p)2ReK/5}f|l}(0| £ + |0)(1|^ +p\l)(l\ E , (5.20) 



where in the last line we assume that the qubit \ip) = a\0) + /3\1). 

It is helpful to examine several cases of the above example. Consider the case where 
the noise parameter p = or p = 1. In this case, the environment receives one of the 
respective states |0) or |1). Therefore, in these cases, the environment does not receive any 
of the quantum information about the state \ip) transmitted down the channel — it does not 
learn anything about the probability amplitudes a or j3. This viewpoint is a completely 
different way to see that the channel is truly noiseless in these cases. A channel is noiseless 
if the environment of the channel does not learn anything about the states that we transmit 
through it, i.e, the channel does not leak quantum information to the environment. Now let 
us consider the case where < p < 1. As p approaches 1/2 from either above or below, the 
amplitude y/p(l — p) of the off-diagonal terms is a monotonic function that reaches its peak 
at 1/2. Thus, at the peak 1/2, the off-diagonal terms are the strongest, implying that the 
environment is stealing much of the coherence from the original quantum state \tf)). 

Exercise 5.2.4 Show that the receiver's output density operator for a bit-flip channel with 
p = 1/2 is the same as what the environment obtains. 

5.2.2 Isometric Extension of a General Noisy Quantum Channel 

We now discuss the general definition of an isometric extension of a quantum channel. 
Let J\f A ^ B denote a noisy quantum channel, where the notation indicates that its input 
is some quantum system A and its output is some quantum system B. The isometric ex- 
tension or Stinespring dilation of a quantum channel is the purification of that channel. 
The isometric extension U^ BE of quantum channel Af A ^ B is a particular map related to 
J\f A ~' B . The input to the isometry U^ BE is the original input system A, and the output of 
the isometry is the channel output B and an environment system E (the environment system 



is analogous to the purification system from Section 5.1). The notation U&^ BE indicates 



the input and output systems, and Figure 5.2 depicts a quantum circuit for the isometric 



extension. An isometry possesses the following two properties: 
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Figure 5.2: The above figure depicts the isometric extension Uu-~* of a quantum channel A/" . The 

extension U£j-~^ BE includes the inaccessible enviroment on system E as a "receiver" of quantum information. 
Ignoring the environment E gives the noisy channel A/" . 



1. It produces the evolution of the noisy quantum channel J\f A ^ B if we trace out the 
environment system: 

Tr E {U*T BE (p)}=M A - B (p), (5.21) 

where p is any density operator input to the channel Af A ^ B . 

2. It behaves as an isometry — it is analogous to a rectangular matrix that behaves some- 
what like a unitary operator. The matrix representation of an isometry is a rectangular 
matrix formed from selecting only a few of the columns from a unitary matrix. An 
isometry obeys the following two properties: 






T 

n 



BE 



(5.22) 
(5.23) 



where II is some projector on the joint system BE. The first property indicates 
that the isometry behaves analogously to a unitary operator, because we can determine 
an inverse operation simply by taking its conjugate transpose. The name "isometry" 
derives from the first property because it implies that the mapping preserves the lengths 
of vectors. The second property distinguishes an isometric operation from a unitary 
one. It states that the isometry takes states in the input system A to a particular 
subspace of the joint system BE. The projector U BE projects onto the subspace 
where the isometry takes input quantum states. 

Isometric Extension from the Kraus Operators 

It is possible to determine the isometric extension of a quantum channel directly from its 
Kraus operators. Consider a noisy quantum channel J\f with the following Kraus repre- 
sentation: 

M A ^ B {p A ) = J2 N jP A N]. (5.24) 
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The isometric extension of the channel J\f A ^ B is the following map: 

u£T BE = Y. N i®\i) E - ( 5 - 25 ) 

3 

It is straightforward to verify that the above map is an isometry. We first need to verify that 
UJj-U/j is equal to the identity on the system A: 

u ] mU N = r£ N t ® <*n (e N i ; ® \j) E ) ( 5 - 26 ) 

= ZXty(*li> (5-27) 

k.j 
k 

= I. (5.29) 

The last equality follows from the completeness condition of the Kraus operators. We next 
need to prove that Uj^Uj^- is a projector on the joint system BE. This follows simply by 
noting that 

UmU^UmU^ = U M (ulU M )ul = U M I A Ul = UxUlf. (5.30) 

We can also prove this a bit more explicitly. First, consider that 

U„Ulr = (j2 N 3 ® \J) E ) [E N t ® W E ) ( 5 - 31 ) 

= E JV X®liX*| B ( 5 - 32 ) 

Let us now verify that the above operator is a projector by squaring it: 

(U M U},) 2 = [Y,NjNt ® UX*r) (E^"^ ® b")(^| E ) ( 5 - 33 ) 

= £ N j NtN j ,Nl®\j)(k\j')(k'\ E (5.34) 

j,k,j',k' 

= J2 N 3NtN k Nl®\j)(k > \ E (5.35) 

j,k,k' 

= E ^ ( E ^W ^ ® UWf (5-36) 

= J2 N i N l®\j)(k'\ E (5.37) 

= U^. (5.38) 
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U 



A^BEf „A 

Af 



(/) = U^Ulr 



Finally, we should verify that Ujt/ is an extension of TV. The application of the isometry to 
an arbitrary density operator p A gives the following map: 

(5.39) 
(5.40) 

(5.41) 



fe*i®ii>*y(E*2 

3,k 



<8 (k\ ] 



and tracing out the environment system gives back the original noisy channel Af A ^ B : 

Tr E {ur BE (p A )}=J2N jP A NJ (5.42) 

3 

= M A ^ B {p A ). (5.43) 

Exercise 5.2.5 Show that all isometric extensions of a noisy quantum channel are equivalent 



up to an isometry on the environment system (this is similar to the result of Exercise 5.1.2). 
Exercise 5.2.6 Show that the isometric extension of the erasure channel is 



U 



A-^BE 

M 



VT^-e(\0) B (0[ 



\1) B (1\ A ) 



g)|ef 

+ \/i|ef< 



\0) E + VI\e) B (l\ A ®\l) E . (5.44) 



Exercise 5.2.7 Determine the resulting state when Alice inputs an arbitrary pure state \i/j) 
into the erasure channel. Verify that Bob and Eve receive the same ensemble (they have the 
same local density operator) when the erasure probability e = 1/2. 

Exercise 5.2.8 Show that the matrix representation of the isometric extension of the erasure 
channel is 



[ (0\ B (0\ E Ujt BE \0) A 


(0\ B (0\ E Ujt BE \l) A 1 




(0\ B (1\ E U^ BE \0) A 


(0\ B (l\ E Ufr+ BE \l) A 




(0\ B (e\ E Ufr+ BE \0) A 


(0\ B (e\ E U^ BE \l) A 




(l\ B (0\ E Uj^ BE \0) A 


(1\ B (0\ E U^ BE \1) A 




{l\ B (l\ E Uj^ BE \0) A 


(i\ B (i\ E ujr BE \i) A 


= 


(l\ B (e\ E U^ BE \0) A 


(l\ B (e\ E Ufr+ BE \l) A 




<e| B ((f£/^|0} A 


(e\ B (0\ E U^ BE \l) A 




<e| B <l| £ [/^ BS |0) A 


(e\ B (l\ E Ufc+ BE \l) A 




I (e\ B (e\ E U^ BE \0) A 


{e\ B {e\ E Ujt BE \l) A \ 

















'1-e 




















Vi- 


v^ 








V~e 









(5.45) 



Exercise 5.2.9 Show that the matrix representation of the isometric extension U^ 
the amplitude damping channel is 

A .D . .P . .-.-r-t . A _ 



1 






A^BE 



of 



<oi B <orc#- 


> BE \0) A 


(o\ B (o\ E u£- 


* BE \l) A 


(o| B <i|*t#- 


* BE \0) A 


(ofiifujy 


*BE\]\A 


(ifivfufr 


*BE\tyA 


(i\ B (o\ E u$- 


*BEU\A 


[i\ B {i\ E ufr 


* BE \0) A 


(i\ B (i\ E u£- 


+BE\]\A 



1 








V^l 



(5.46) 
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Exercise 5.2.10 Consider a full unitary y AE ^ BE such that 

Tr E [v(p A ®|0)(0f)^} 
gives the amplitude damping channel. Show that the matrix representation of V is 



(5.47) 



i -Ei 



, E 



E 



(onorr|o) A |or (onorvionir wwv\\tw (ononv|ini> 



,JS 



(irvio)^|o)" (or(irv r |o) A |i) j 



\E 



E, 



, E 



> B (l\ E V\l) A \Q) B 



(lr^ryiorio)^ <ir(orv|o> A iir (i 



,-E 



" B [l\ E V\l) A \l) E 



E, 



l -Ei 



i -Ei 



J 



viirior (ir(orv|i) A |i> 

(1^(11^10)^10)" (lnii^ioni)" (i^^i^ii^io)^ (ii s (ii £ ^ii) a ii) 

-^1 = 7 ^7 0' 

10 

1 

^7 \/l~ 3 7 



E 



(5.48 



Exercise 5.2.11 Consider the full unitary operator for the amplitude damping channel from 
the previous exercise. Show that the density operator 



Tr B {v(p A ®\0)(0\ E )vi} 

that Eve receives has the following matrix representation: 



(5.49) 



if 



IP V~fV 



1 — p 7] 
V* P 



(5.50) 



(5.51) 



By comparing with (4.231), observe that the output to Eve is the bit flip of the output of 



an amplitude damping channel with damping parameter 1 — 7. 

Exercise 5.2.12 Consider the amplitude damping channel with parameter 7. Show that it 
is possible to simulate the channel to Eve by first sending Alice's density operator through 
an amplitude damping channel with parameter 1 — 7 and swapping A with E, i.e., show that 



Tr 



>>{^(V 



V]} = Tr E {sV l ^(p A ®\0)(0\ E )v^S*}, 



(5.52) 



where V 1 is the unitary operator for an amplitude damping channel with parameter 7 and 
S is the swapping operator. 
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Complementary channel 

In the purified quantum theory, it is useful to consider all parties that are participating in 
a given protocol. One such party is the environment of the channel and we call her Eve, 
because she corresponds to an eavesdropper in the cryptographic setting. 

For any quantum channel J\f A ^ B , there exists an isometric extension XJyf^ BE of that 
channel. The complementary channel (AT C ) ~~* is a quantum channel from Alice to Eve. We 
obtain it by tracing out Bob's system from the output of the isometry: 

[M C ) A ^ E {P) = Tr B {Ujt BE ( P )}. (5.53) 

It captures the noise that Eve "sees" by having her system coupled to Bob's system. 
Exercise 5.2.13 Show that Eve's density operator is of the following form: 

p^^Trj^iVJJl^il, (5.54) 



if we take the isometric extension of the channel to be of the form in (5.25). 



The complementary channel is unique only up to an isometry on Eve's system. It inherits 
this property from the fact that an isometric extension of a noisy channel is unique only up 
to isometries on Eve's system. For all practical purposes, this lack of uniqueness does not 
affect our study of the noise that Eve sees because the measures of noise in Chapter 11 are 
invariant under isometries on Eve's system. 

5.2.3 Generalized Dephasing Channels 

A generalized dephasing channel is one that preserves states diagonal in some preferred 
orthonormal basis {|x}}, but it can add arbitrary phases to the off-diagonal elements of 
a density operator in this basis. The isometry of a generalized dephasing channel acts as 
follows on the basis (|x)}: 

Vjfr BE \x) A ' = \z) B \<p a ) E , (5.55) 

where \cp x ) is some state for the environment (these states need not be mutually orthogonal). 
Thus, we can represent the isometry as follows: 

U^^WM*^', (5-56) 

X 

and its action on a density operator p is 

UtfnPUhh = J2( x \p\ x ') \ X )( X '\ B ® \<Px)(<Px>\ E - (5-57) 

x,x' 

Tracing out the environment gives the action of the channel A/d to the receiver 

M)Qo) = /Ax\p\%')(<Px'\<Px) \ x )( x '\ , (5.58) 
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where we observe that this channel preserves the diagonal components {|a;)(a;|} of p, but 
it multiplies the d(d — 1) off-diagonal elements of p by arbitrary phases, depending on the 
d(d — 1) overlaps {<p X '\<p x ) of the environment states (where x ^ x'). Tracing out the receiver 
gives the action of the complementary channel J\f£ to the environment 

A/S(p) = X>Mx> MM E - (5-59) 

X 

Observe that the channel to the environment is entanglement- breaking. That is, the action 
of the channel is the same as first performing a von Neumann measurement in the basis 
{|x}} and preparing a state \<p x ) conditional on the outcome of the measurement (it is a 



classical-quantum channel from Section 4.4.6). Additionally, the receiver Bob can simulate 



the action of this channel to the receiver by performing the same actions on the state that 
he receives. 

Exercise 5.2.14 Explicitly show that the following qubit dephasing channel is a special case 
of a generalized dephasing channel: 

p^{l-p)p + P ZpZ. (5.60) 

5.2.4 Quantum Hadamard Channels 

Quantum Hadamard channels are those whose complements are entanglement-breaking. We 
can write its output as the Hadamard product (element- wise multiplication) of a representa- 
tion of the input density operator with another operator. To discuss how this comes about, 
suppose that the complementary channel (A/" c ) ~~* of a channel J\f A ~" B is entanglement- 
breaking. Then, using the fact that its Kraus operators |£j) {Q\ are unit rank (see Theo- 
rem 4.4.1 ) and the construction in (5.25) for an isometric extension, we can write an isometric 



extension Uj^c for M c as 



u^pU^ = Yl\tt E fa\ A 'p\Q A '(tj\ B ® I0 fl 0'l fl (5-61) 



I J 



= E^| A 'plO) A 'l^^f ®H) B (j\ B . (5.62) 

The sets (|£j) } and {\Q) } each do not necessarily consist of orthonormal states, but the 
set {\i) } does because it is the environment of the complementary channel. Tracing over 
the system E gives the original channel from system A' to B: 

<^ b (p) = Y,(Ci\ A 'p\Q A '(tim B \*> B u\ B . (s-es) 



• j 



Let S denote the matrix with elements [S]- • = (Q\ p\Cj) , a representation of the input 



state p, and let F denote the matrix with elements [r]^ • = (£»|£j) ■ Then, from (5.63), it is 
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clear that the output of the channel is the Hadamard product * of £ and 1^ with respect to 
the basis {\i) }: 

A^'^ s (p) = S*r t . (5.64) 

For this reason, such a channel is known as a Hadamard channel. 

Hadamard channels are degradable, meaning that there exists a degrading map X> B ^ E 
such that Bob can simulate the channel to Eve: 

Vp V B - E (Af£'- B (p)) = (A/hT'^p). (5.65) 

If Bob performs a von Neumann measurement of his state in the basis {\i) } and prepares 
the state |£j) conditional on the outcome of the measurement, this procedure simulates 
the complementary channel (J\f c ) ~~* and also implies that the degrading map X> B ^ E is 
entanglement-breaking. To be more precise, the Kraus operators of the degrading map X> B ^ E 
are {|£j} (i\ } so that 

V B - E {N*- B {o)) = ^\Zi) E <t\ B tf A ^ B WM B (£i\ E (5-66) 

i 
i 

demonstrating that this degrading map effectively simulates the complementary channel (A/^) A " 
Note that we can view this degrading map as the composition of two maps: a first map T> B ~* Y 
performs the von Neumann measurement, leading to a classical variable Y, and a second 
map T> Y performs the state preparation, conditional on the value of the classical variable 
Y . We can therefore write X> B ^ E = T> Y ^ E o T> B ~* Y . This particular form of the channel has 



implications for its quantum capacity (see Chapter 23) and its more general capacities (see 



Chapter 24). Observe that a generalized dephasing channel from the previous section is a 



quantum Hadamard channel because the map to its environment is entanglement-breaking. 

5.3 Coherent Quantum Instrument 

It is useful to consider the isometric extension of a quantum instrument (we discussed quan- 



tum instruments in Section 4.4.7). This viewpoint is important when we recall that a quan- 
tum instrument is the most general map from a quantum system to a quantum system and 
a classical system. 



Recall from Section 4.4.7 that a quantum instrument acts as follows: 

p-+ y ZEf-+ B (p)®m\ J , (5.68) 

3 

where each Sj is a completely-positive trade-reducing (CPTR) map that has the following 
form: 

k 
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and the operators Mj^ satisfy the relation J^ fc M*- k Mjj~ < I. 

We now describe a particular coherent evolution that implements the above transforma- 
tion when we trace over certain degrees of freedom. An isometric extension of each CPTR 
map £j is as follows: 

k 

where the operator Mj^ acts on the input state and the environment system E is large 
enough to accomodate all of the CPTR maps Ej. That is, if the first map E\ has states 

< |1) , . . . , |di) >, then the second map £2 has states < \d\ + 1) , . . . , |di + ^2} r so that 

the states on E are orthogonal for all the different maps Ej that are part of the instrument. 



We can embed this isometric extension into the evolution in (5.68) as follows 



This evolution is not quite fully coherent, but a simple modification of it does make it fully 
coherent: 

T,V£+ BB ®\j) J ®\j) E >. (5.72) 

3 
The full action of the coherent instrument is then as follows: 

P "> E U t BE (P) ® \M\ J ® \3Wf J (5-73) 

3 

= Y, M hkP My k ,®\k){k'\ E ®\j)(f\ J ®\j)(j'\ Ej . (5.74) 

j,k,j',k' 

One can then check that tracing over the environmental degrees of freedom E and Ej re- 



produces the action of the quantum instrument in (5.68). 

5.4 Coherent Measurement 



We end this chapter by discussing a coherent measurement. This last section shows that it 
is sufficient to describe all of the quantum theory in the so-called "traditionalist" way by 
using only unitary evolutions and von Neumann projective measurements. 

Suppose that we have a set of measurement operators {Mj} . such that J2 . M-Mj = I. 
In the noisy quantum theory, we found that the post-measurement state of a measurement 
on a quantum system S with density operator p is 

M iP M] 

-^rf, (5-75) 

pAj) 

where the measurement outcome j occurs with probability 

pj(j) = Tt{m]M jP }. (5.76) 
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We would like a way to perform the above measurement on system S in a coherent fashion. 



The isometry in (|5.25|) gives a hint for how we can structure such a coherent measurement. 

(5.77) 



We can build the coherent measurement as the following isometry: 



U s ^ ss ' = J2M^\jf. 



Appying this isometry to a density operator p gives the following state 



(5.78) 



j j 



We can then apply a von Neumann measurement with projection operators {|j)(j|} 7 - to the 
system S', which then gives the following post-measurement state: 

(I S ® \j)(jf )(U s - ss '(p))(I s ® \j)(jf ) 



Tr 



{(i s ®\j)(j\ s ')(u s - ss '(p))} 



' ®\J)(J\ S - (5-79) 



Tr{(M/)tMfp} 



The result is then the same as that in (5.75). 



Exercise 5.4.1 Suppose that there is a set of density operators pf and a POVM {Af } that 
identifies these states with high probability, in the sense that 



Vfc Tr{Afpf}>l-e, 



(5.80) 



where e is some small number such that e > 0. Construct a coherent measurement U s ^ ss 
and show that the coherent measurement has a high probability of success in the sense that 



k\ RS (kfu s - ss '\^) RS >i 



e, 



(5.81) 



RS ■ 



where each \(f>k) is a purification of pk- 



5.5 History and Further Reading 



The purified view of quantum mechanics has long been part of quantum information theory 
(e.g., see the book of Nielsen and Chuang |197] or Yard's thesis |266j ) . The notion of 
an isometric extension of a quantum channel is due to early work of Stinespring [236J. 
Giovannetti and Fazio discussed some of the observations about the amplitude damping that 
appear in our exercises |102] . Devetak and Shor introduced generalized dephasing channels in 
the context of trade-off coding and they also introduced the notion of a degradable quantum 
channel [73J. King et al. studied the quantum Hadamard channels in Ref. |172j . Coherent 
instruments and measurements appeared in Refs. j68l [751 H56| as part of the decoder used 



in several quantum coding theorems. We exploit them in Chapters 23 and 24 
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CHAPTER 6 



Three Unit Quantum Protocols 



This chapter begins our first exciting application of the postulates of the quantum theory 
to quantum communication. We study the fundamental, unit quantum communication pro- 
tocols. These protocols involve a single sender, whom we name Alice, and a single receiver, 
whom we name Bob. The protocols are ideal and noiseless because we assume that Alice and 
Bob can exploit perfect classical communication, perfect quantum communication, and per- 
fect entanglement. At the end of this chapter, we suggest how to incorporate imperfections 
into these protocols for later study. 

Alice and Bob may wish to perform one of several quantum information processing tasks, 
such as the transmission of classical information, quantum information, or entanglement. 
Several fundamental protocols make use of these resources: 

1. We will see that noiseless entanglement is an important resource in quantum Shannon 
theory because it enables Alice and Bob to perform other protocols that are not possible 
with classical resources only. We will present a simple, idealized protocol for generating 
entanglement, named entanglement distribution. 

2. Alice may wish to communicate classical information to Bob. A trivial method, named 
elementary coding, is a simple way for doing so and we discuss it briefly. 

3. A more elegant technique for transmitting classical information is super-dense coding. 
It exploits a noiseless qubit channel and shared entanglement to transmit more classical 
information than would be possible with a noiseless qubit channel only. 

4. Finally, Alice may wish to transmit quantum information to Bob. A trivial method for 
Alice to transmit quantum information is for her to exploit a noiseless qubit channel. 
Though, it is useful to have other ways for transmitting quantum information because 
such a resource is difficult to engineer in practice. An alternative, surprising method 
for transmitting quantum information is quantum teleportation. The teleportation pro- 
tocol exploits classical communication and shared entanglement to transmit quantum 
information. 
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Each of these protocols is a fundamental unit protocol and provides a foundation for 
asking further questions in quantum Shannon theory. In fact, the discovery of these latter 
two protocols was the stimulus for much of the original research in quantum Shannon theory. 

We introduce the technique of resource counting in this chapter. This technique is of 
practical importance because it quantifies the communication cost of achieving a certain 
task. We include only nonlocal resources in a resource count — nonlocal resources include 
classical or quantum communication or shared entanglement. 

It is important to minimize the use of certain resources, such as noiseless entanglement 
or a noiseless qubit channel, in a given protocol because they are expensive. Given a certain 
implementation of a quantum information processing task, we may wonder if there is a way 
of implementing it that consumes fewer resources. A proof that a given protocol is the best 
that we can hope to do is an optimality proof (also known as a converse proof, as discussed in 



Section 2.1.3). We argue, based on good physical grounds, that the protocols in this chapter 



are the best implementations of the desired quantum information processing task. 

6.1 Nonlocal Unit Resources 

We first briefly define what we mean by a noiseless qubit channel, a noiseless classical bit 
channel, and noiseless entanglement. Each of these resources is a nonlocal, unit resource. A 
resource is nonlocal if two spatially separated parties share it or if one generates it to the 
other. We say that a resource is unit if it comes in some "gold standard" form, such as 
qubits, classical bits, or entangled bits. It is important to establish these definitions so that 
we can check whether a given protocol is truly simulating one of these resources. 
A noiseless qubit channel is any mechanism that implements the following map: 

\i) A ^\i) B , (6.1) 

where i G {0,1}, {|0) , |1) } is some preferred orthonormal basis on Alice's system, and 
{|0} , |1) } is some preferred orthonormal basis on Bob's system. The bases do not have to 
be the same, but it must be clear which basis each party is using. The above map is linear 
so that it preserves arbitrary superposition states (it preserves any qubit). For example, the 
map acts as follows on a superposition state: 

a\0) A + /3\l) A ^a\0) B + P\l) B - (6-2) 

We can also write it as the following isometry: 

^r (6.3) 



1 

8=0 



Any information processing protocol that implements the above map simulates a noiseless 
qubit channel. We label the communication resource of a noiseless qubit channel as follows: 

[q - q], (6.4) 
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where the notation indicates one forward use of a noiseless qubit channel. 

A noiseless classical bit channel is any mechanism that implements the following map: 

\i)(i\ A ^\i){i\ B , (6.5) 

\i)(j\ ^0 for i ^ j (6.6) 

where i, j G {0, 1} and the orthonormal bases are again arbitrary. This channel maintains 
the diagonal elements of a density operator in the basis < |0) , |1) >, but it eliminates the 
off-diagonal elements. We can write it as the following map: 



J2\i) B (i\ A p\i) A (i\ B . (6.7) 



4 = 

This resource is weaker than a noiseless qubit channel because it does not require Alice and 
Bob to maintain arbitrary superposition states — it merely transfers classical information. 
Alice can of course use the above channel to transmit classical information to Bob. She can 
prepare either of the classical states |0)(0| or |1)(1|, send it through the classical channel, and 
Bob performs a computational basis measurement to determine the message Alice transmits. 
We denote the communication resource of a noiseless classical bit channel as follows: 

[c^c], (6.8) 

where the notation indicates one forward use of a noiseless classical bit channel. 

We can study other ways of transmitting classical information. For example, suppose 
that Alice flips a fair coin that chooses the state |0) or |1) with equal probability. The 
resulting state is the following density operator: 

^(|0}(0| A + |1)(1| A ). (6.9) 

Suppose that she sends the above state through a noiseless classical channel. The resulting 
density operator for Bob is as follows: 

^(|0}(0| B + |1)(1| B ). (6.10) 

The above classical bit channel map does not necessarily preserve off-diagonal elements 
of a density operator. Suppose instead that Alice prepares a superposition state 

^. 

The density operator corresponding to this state is 

i(|0}(0| A + |0}(l| A + |l}(0| A + |l)(l| A ). (6.12) 

©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



174 CHAPTER 6. THREE UNIT QUANTUM PROTOCOLS 



Suppose Alice then transmits this state through the above classical channel. The classical 
channel eliminates all the off-diagonal elements of the density operator and the resulting 
state for Bob is as follows: 

^(|0}(0| B + |1)(1| B ). (6.13) 

Thus, it is impossible for a noiseless classical channel to simulate a noiseless qubit channel 
because it cannot maintain arbitrary superposition states. Though, it is possible for a 
noiseless qubit channel to simulate a noiseless classical bit channel and we denote this fact 
with the following resource inequality: 

[<Z->?]>[c-c]. (6.14) 

Noiseless quantum communication is therefore a stronger resource than noiseless classical 
communication. 



Exercise 6.1.1 Show that the noisy dephasing channel in (4.206) with p = 1/2 is equivalent 
to a noiseless classical bit channel. 

The final resource that we consider is shared entanglement. The ebit is our "gold stan- 
dard" resource for pure bipartite (two-party) entanglement, and we will make this point 



more clear operationally in Chapter 18. An ebit is the following state of two qubits: 



AB = loo)" -Hip" 

1 ' V2 

where Alice possesses the first qubit and Bob possesses the second. 

Below, we show how a noiseless qubit channel can generate a noiseless ebit through a sim- 
ple protocol named entanglement distribution. Though, an ebit cannot simulate a noiseless 
qubit channel (for reasons which we explain later). Therefore, noiseless quantum communi- 
cation is the strongest of all three resources, and entanglement and classical communication 
are in some sense "orthogonal" to one another because neither can simulate the other. 

6.2 Protocols 

6.2.1 Entanglement Distribution 

The entanglement distribution protocol is the most basic of the three unit protocols. It 
exploits one use of a noiseless qubit channel to establish one shared noiseless ebit. It consists 
of the following two steps: 



1. Alice prepares a Bell state locally in her laboratory. She prepares two qubits in the 
state |0) |0) , where we label the first qubit as A and the second qubit as A'. She 
performs a Hadamard gate on qubit A to produce the following state: 

(6.16) 
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Figure 6.1: The above figure depicts a protocol for entanglement distribution. Alice performs local opera- 
tions (the Hadamard and CNOT) and consumes one use of a noiseless qubit channel to generate one noiseless 



ebit !$-< 



,AB 



shared with Bob. 



She then performs a CNOT gate with qubit A as the source qubit and qubit A' as the 
target qubit. The state becomes the following Bell state: 



$ 



+ 



AA' 



|00) 



AA' 



111) 



AA' 



V2 



(6.17) 



2. She sends qubit A' to Bob with one use of a noiseless qubit channel. Alice and Bob 
then share the ebit |$ + ) 



Figure 6.1 depicts the entanglement distribution protocol. 



The following resource inequality quantifies the nonlocal resources consumed or generated 
in the above protocol: 

[q^q]> [qq], (6.18) 

where [q — ► q] denotes one forward use of a noiseless qubit channel and [qq] denotes a shared, 
noiseless ebit. The meaning of the resource inequality is that there exists a protocol that 
consumes the resource on the left in order to generate the resource on the right. The best 
analogy is to think of a resource inequality as a "chemical reaction" -like formula, where the 
protocol is like a chemical reaction that transforms one resource into another. 

There are several subtleties to notice about the above protocol and its corresponding 
resource inequality: 

1. We are careful with the language when describing the resource state. We described 
the state |$ + ) as a Bell state in the first step because it is a local state in Alice's 
laboratory. We only used the term "ebit" to describe the state after the second step, 
when the state becomes a nonlocal resource shared between Alice and Bob. 
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2. The resource count involves nonlocal resources only — we do not factor any local oper- 
ations, such as the Hadamard gate or the CNOT gate, into the resource count. This 
line of thinking is different from the theory of computation, where it is of utmost im- 
portance to minimize the number of steps involved in a computation. In this book, 
we are developing a theory of quantum communication and we thus count nonlocal 
resources only. 

3. We are assuming that it is possible to perform all local operations perfectly. This line 
of thinking is another departure from practical concerns that one might have in fault 
tolerant quantum computation, the study of the propagation of errors in quantum 
operations. Performing a CNOT gate is a highly nontrivial task at the current stage of 
experimental development in quantum computation, with most implementations being 
far from perfect. Nevertheless, we proceed forward with this communication-theoretic 
line of thinking. 

The following exercises outline classical information processing tasks that are analogous 
to the task of entanglement distribution. 

Exercise 6.2.1 Outline a protocol for common randomness distribution. Suppose that Alice 
and Bob have available one use of a noiseless classical bit channel. Give a method for them 
to implement the following resource inequality: 

[c^c]>[cc], (6.19) 

where [c — ► c] denotes one forward use of a noiseless classical bit channel and [cc] denotes a 
shared, nonlocal bit of common randomness. 

Exercise 6.2.2 Consider three parties Alice, Bob, and Eve and suppose that a noiseless 
private channel connects Alice to Bob. Privacy here implies that Eve does not learn anything 
about the information that traverses the private channel — Eve's probability distribution is 
independent of Alice and Bob's: 

PA,B,E(a,b,e) = p A (a)p B \A(b\a)p E (e). (6.20) 

For a noiseless private bit channel, PB\A{b\a) = 5^. A noiseless secret key corresponds to 
the following distribution: 

PA,B,E(a,b,e) = -5 b , a p E {e), (6.21) 

where | implies that the key is equal to '0' or '1' with equal probability, 5b t a implies a perfectly 
correlated secret key, and the factoring of the distribution PA^^i^, b, e) implies the secrecy 
of the key (Eve's information is independent of Alice and Bob's). The difference between 
a noiseless private bit channel and a noiseless secret key is that the private channel is a 
dynamic resource while the secret key is a shared, static resource. Show that it is possible 
to upgrade the protocol for common randomness distribution to a protocol for secret key 
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distribution, if Alice and Bob share a noiseless private bit channel. That is, show that they 
can achieve the following resource inequality: 



[c - cU > [<*U. (6-22) 



where [c — > c] . denotes one forward use of a noiseless private bit channel and [cc] . denotes 
one bit of shared, noiseless secret key. 

Entanglement and Quantum Communication 

Can entanglement enable two parties to communicate quantum information? It is natural to 
wonder if there is a protocol corresponding to the following resource inequality: 

[qq]>[q^q]. (6.23) 

Such a resource inequality would be of great utility. 

Unfortunately, it is physically impossible to construct a protocol that implements the 
above resource inequality. The argument against such a protocol arises from the theory of 
relativity. Specifically, the theory of relativity prohibits information transfer or signaling at a 
speed greater than the speed of light. Suppose that two parties share noiseless entanglement 
over a large distance. That resource is a static resource, possessing only shared quantum 
correlations. If a protocol were to exist that implements the above resource inequality, it 
would imply that two parties could communicate quantum information faster than the speed 
of light, because they would be exploiting the entanglement for the instantaneous transfer 
of quantum information. 



The entanglement distribution resource inequality is only "one-way," as in (6.18). Quan- 
tum communication is therefore strictly stronger than shared entanglement when no other 
nonlocal resources are available. 

6.2.2 Elementary Coding 

We can also send classical information with a noiseless qubit channel. A simple protocol for 
doing so is elementary coding. This protocol consists of the following steps: 

1. Alice prepares either |0) or |1), depending on the classical bit that she would like to 
send. 

2. She transmits this state over the noiseless qubit channel and Bob receives the qubit. 

3. Bob performs a measurement in the computational basis to determine the classical bit 
that Alice transmitted. 

Elementary coding succeeds without error because Bob's measurement can always distin- 
guish the classical states |0) and |1). The following resource inequality applies to elementary 
coding: 

[?-9]>[c-c]. (6.24) 
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Again, we are only counting nonlocal resources in the resource count — we do not count the 
state preparation at the beginning or the measurement at the end. 

If no other resources are available for consumption, the above resource inequality is 
optimal — one cannot do better than to transmit one classical bit of information per use of 
a noiseless qubit channel. This result may be a bit frustrating at first, because it may seem 
that we could exploit the continuous degrees of freedom in the probability amplitudes of 
a qubit state for encoding more than one classical bit per qubit. Unfortunately, there is 
no way that we can access the information in the continuous degrees of freedom with any 



measurement scheme. The result of Exercise 4.2.2 demonstrates the optimality of the above 



protocol, and it holds as well by use of the Holevo bound in Chapter 11 



6.2.3 Quantum Super-Dense Coding 

We now outline a protocol named super- dense coding. It is named such because it has the 
striking property that noiseless entanglement can double the classical communication ability 
of a noiseless qubit channel. It consists of three steps: 

1. Suppose that Alice and Bob share an ebit |$ + ) . Alice applies one of four unitary 
operations {/, X, Z,XZ} to her side of the above state. The state becomes one of the 
following four Bell states (up to a global phase), depending on the message that Alice 
chooses: 

, x\iB \-r_\AB T i \AB i T _\AB ,„ __ N 

$ ) j |$ ) , |* ) , I* ) • ( 6 - 25 ) 



The definitions of these Bell states are in (3.194 3.195) 



2. She transmits her qubit to Bob with one use of a noiseless qubit channel. 

3. Bob performs a Bell measurement (a measurement in the basis {|$ + } , l^ - } , 
\ty + ) , \^~) }) to distinguish perfectly the four states — he can distinguish the states 
because they are all orthogonal to each other. 

Thus, Alice can transmit two classical bits (corresponding to the four messages) if she 



shares a noiseless ebit with Bob and uses a noiseless qubit channel. Figure |6.2| depicts the 
protocol for quantum super-dense coding. 

The super-dense coding protocol implements the following resource inequality: 

[qq] + [q -»• q] > 2[c -»■ c]. (6.26) 

Notice again that the resource inequality counts the use of nonlocal resources only — we do 
not count the local operations at the beginning of the protocol or the Bell measurement at 
the end of the protocol. 

Also, notice that we could have implemented two noiseless classical bit channels with two 
instances of elementary coding: 

2[q ->■ q] > 2[c ->■ c]. (6.27) 
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Conditional Operations 



Qubit 
Channel 




Bell Measurement 

Figure 6.2: The above figure depicts the dense coding protocol. Alice and Bob share an ebit before the 
protocol begins. Alice would like to transmit two classical bits x±X2 to Bob. She performs a Pauli rotation 
conditional on her two classical bits and sends her half of the ebit over a noiseless qubit channel. Bob can 
then recover the two classical bits by performing a Bell measurement. 



Though, this method is not as powerful as the super-dense coding protocol — in super-dense 
coding, we consume the weaker resource of an ebit to help transmit two classical bits, instead 
of consuming the stronger resource of an extra noiseless qubit channel. 

The dense coding protocol also transmits the classical bits privately. Suppose a third 
party intercepts the qubit that Alice transmits. There is no measurement that the third 
party can perform to determine which message Alice transmits because the local density 
operator of all of the Bell states is the same and equal to the maximally mixed state r K A (the 
information for the eavesdropper is constant for each message that Alice transmits). The 
privacy of the protocol is due to Alice and Bob sharing maximal entanglement. We exploit 
this aspect of the dense coding protocol when we make it coherent in Chapter [7j 



6.2.4 Quantum Teleportation 

Perhaps the most striking protocol in noiseless quantum communication is the quantum 
teleportation protocol. The protocol destroys the quantum state of a qubit in one location 
and recreates it on a qubit at a distant location, with the help of shared entanglement. Thus, 
the name "teleportation" corresponds well to the mechanism that occurs. 

The teleportation protocol is actually a flipped version of the super-dense coding pro- 
tocol, in the sense that Alice and Bob merely "swap their equipment." The first step in 
understanding teleportation is to perform a few algebrai c steps using the tricks of the tensor 



product and the Bell state substitutions from Exercise 
Alice possesses, where 



3.5.13 



Consider a qubit 



,A' 



that 



, A' 



>A' 



a\0y +/3\1) 



A' 



Suppose she shares a maximally entangled state 
systems A', A, and B is as follows: 



(6.28) 
$ + } /1£i with Bob. The joint state of the 



,AB 



j' 



r 



AB 



(6.29) 
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Bell Measurement 



Two Classical 
Channels 




Conditional Operations 

Figure 6.3: The above figure depicts the teleportation protocol. Alice would like to transmit an arbitrary 
quantum state \ip) to Bob. Alice and Bob share an ebit before the protocol begins. Alice can "teleport" 
her quantum state to Bob by consuming the entanglement and two uses of a noiseless classical bit channel. 



Let us first explicitly write out this state: 



) A '\$ + ) AB 



|00) AB + |11) AB ' 
V2 



(6.30) 



Distributing terms gives the following equality: 



1 
72 



a\000f AB + p\100f AB + a|011) A ' AB + f3\lllf AB 



(6.31) 



We use the relations in Exercise 3.5.13 to rewrite the joint system A' A in the Bell basis: 



a(\*+) A ' A + \$-f A ) \0) B + P(\* + ) A ' A ~ \*-f A ) |0) J 



a |$+) 



A'A 



,A'A 



A'A 



t A'A 



$-)"« |l)" + |$+> AA -|$-) AA |1> 



(6.32) 



Simplifying gives the following equivalence: 



1 


\<P+) A ' A (a\0) B + P\1) B ) + \$-) A ' A (*\0) B ~ 


-m B ) ' 


2 


+ \M;+) A ' A (a\l) B + (3\0) B )+\*-) A ' A (a\l) B 


-m B ) _ 



(6.33) 



We can finally rewrite the state as four superposition terms, with a distinct Pauli operator 
applied to Bob's system B for each term in the superposition: 



$~ 



A'A, 



\^>~) A ' A z\%b) B + \^ + ) A ' A x\tp) B + \^-) A ' A xz\tp) B 



(6.34) 



We now outline the three steps of the teleportation protocol (Figure 6.3 depicts the 
protocol): 
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1. Alice performs a Bell measurement on her systems A' A. The state collapses to one of 
the following four states with uniform probability: 

\$ + ) A ' A \i>) B , (6.35) 

\<S>-) A ' A Z\tP) B , (6.36) 

\^+) A ' A X\^) B , (6.37) 

\^>-) A ' A XZ\tP) B . (6.38) 

Notice that the state resulting from the measurement is a product state with respect 



to the cut A'A — B, regardless of the outcome of the measurement. At this point, Alice 
knows whether Bob's state is \i/j) , Z\ip) , X\ijj) , or XZ\i/j) because she knows 
the result of the measurement. On the other hand, Bob does not know anything 



about the state of his system B — Exercise 4.4.9 proves that his local density operator 
is the maximally mixed state ir B just after Alice performs the measurement. Thus, 
there is no teleportation of quantum information at this point because Bob's state is 
completely independent of the original state \ip). In other words, teleportation cannot 
be instantaneous. 

2. Alice transmits two classical bits to Bob that indicate which of the four measurement 
results she obtains. After Bob receives the classical information, he is immediately 
certain which operation he needs to perform in order to restore his state to Alice's 
original state \tp). Notice that he does not need to have knowledge of the state in order 
to restore it — he only needs knowledge of the restoration operation. 

3. Bob performs the restoration operation: one of the identity, a Pauli X operator, a 
Pauli Z operator, or the Pauli operator XZ , depending on the classical information 
that he receives from Alice. 

Teleportation is an oblivious protocol because Alice and Bob do not require any knowledge 
of the quantum state being teleported in order to perform it. We might also say that this 
feature of teleportation makes it universal — it works independent of the input state. 

You might think that the teleportation protocol violates the no-cloning theorem because 
a "copy" of the state appears on Bob's system. But this violation does not occur at any 
point in the protocol because the Bell measurement destroys the information about the state 
of Alice's original information qubit while recreating it somewhere else. Also, notice that 
the result of the Bell measurement is independent of the particular probability amplitudes 
a and j3 corresponding to the state Alice wishes to teleport. 

The teleportation protocol is not an instantaneous teleportation, as portrayed in the 
television episodes of Star Trek. There is no transfer of quantum information instantaneously 
after the Bell measurement because Bob's local description of the B system is the maximally 
mixed state n. It is only after he receives the classical bits to "telecorrect" his state that 
the transfer occurs. It must be this way — otherwise, they would be able to communicate 
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faster than the speed of light, and superluminal communication is not allowed in the theory 
of relativity. 

Finally, we can phrase the teleportation protocol as a resource inequality: 

[qq] + 2[c^c]>[q^q]. (6.39) 

Again, we factor only nonlocal resources into the resource count. The above resource in- 
equality is perhaps the most surprising of the three unit protocols we have studied so far. It 
combines two resources, noiseless entanglement and noiseless classical communication, that 
achieve noiseless quantum communication even though they are both individually weaker 
than it. This protocol and super-dense coding are two of the most fundamental protocols in 
quantum communication theory because they sparked the notion that there are clever ways 
of combining resources to generate other resources. 



In Exercise 6.2.3 below, we discuss a variation of teleportation called remote state prepa- 
ration, where Alice possesses a classical description of the state that she wishes to teleport. 
With this knowledge, it is possible to reduce the amount of classical communication necessary 
for teleportation. 

Exercise 6.2.3 Remote state preparation is a variation on the teleportation protocol. We 
consider a simple example of a remote state preparation protocol. Suppose Alice possesses a 
classical description of a state \ip) = (|0) + e l ^|l))/v2 (on the equator of the Bloch sphere) 
and she shares an ebit |$ + ) ' with Bob. Alice would like to prepare this state on Bob's 
system. Show that Alice can prepare this state on Bob's system if she measures her system A 
in the {|^*}, |V' J "*)} basis, transmits one classical bit, and Bob performs a recovery operation 
conditional on the classical information. (Note that \ip*) is the conjugate of the vector \if))). 

Exercise 6.2.4 Third-party controlled teleportation is another variation on the teleportation 
protocol. Suppose that Alice, Bob, and Charlie possess a GHZ state: 

1000)^ + 1111)^° 
|$ghz) = 7= • (6.40) 

v2 

Alice would like to teleport an arbitrary qubit to Bob. She performs the usual steps in the 
teleportation protocol. Give the final steps that Charlie should perform and the information 
that he should transmit to Bob in order to complete the teleportation protocol. (Hint: The 
resource inequality for the protocol is as follows: 

\wq] A Bc + 2 t c -* °]a^b + [c -»■ c\c^b > [q -»■ qIa^b? ( 6 - 41 ) 

where [qqq] ABC represents the resource of the GHZ state shared between Alice, Bob, and 
Charlie, and the other resources are as before with the directionality of communication 
indicated by the corresponding subscript.) 

Exercise 6.2.5 Gate teleportation is yet another variation of quantum teleportation that is 
useful in fault-tolerant quantum computation. Suppose that Alice would like to perform a 
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single-qubit gate [Zona qubit in state \ifi). Suppose that the gate U is difficult to perform, 
but that U<JiU\ where <Xj is one of the single-qubit Pauli operators, is much less difficult 
to perform. A protocol for gate teleportation is as follows. Alice and Bob first prepare the 
ebit U B \& + ) . Alice performs a Bell measurement on her qubit \ifi) and system A. She 
transmits two classical bits to Bob and Bob performs one of the four corrective operations 
UoiU^ on his qubit. Show that this protocol works, i.e., Bob's final state is U\ip). 

Exercise 6.2.6 Show that it is possible to simulate a dephasing qubit channel by the follow- 
ing technique. First, Alice prepares a maximally entangled Bell state |$ + ). She sends half 
of it to Bob through a dephasing qubit channel. She and Bob perform the usual teleporta- 
tion protocol. Show that this procedure gives the same result as sending a qubit through a 
dephasing channel. (Hint: This result holds because the dephasing channel commutes with 
all Pauli operators.) 

Exercise 6.2.7 Construct an entanglement swapping protocol from the teleportation pro- 
tocol. That is, suppose that Charlie and Alice possess a bipartite state m) . Show that 
if Alice teleports her half of the state \tjj) to Bob, then Charlie and Bob share the state 
\tf)) . A special case of this protocol is when the state \tf)) is an ebit. Then the protocol 
is equivalent to an entanglement swapping protocol. 

6.3 Optimality of the Three Unit Protocols 

We now consider several arguments that may seem somewhat trivial at first, but they are 
crucial in a good theory of quantum communication. We are always thinking about the 
optimality of certain protocols — if there is a better, cheaper way to perform a given protocol, 
we would prefer to do it this way so that we do not have to pay expensive bills to the quantum 
communication companies of the future]^] There are several questions that we can ask about 
the above protocols: 

1. In entanglement distribution, is one ebit per qubit the best that we can do, or is it 
possible to generate more than one ebit with a single use of a noiseless qubit channel? 

2. In super-dense coding, is it possible to generate two noiseless classical bit channels 
with less than one noiseless qubit channel or less than one noiseless ebit? Is it possible 
to generate more than two classical bit channels with the given resources? 

3. In teleportation, is it possible to teleport more than one qubit with the given resources? 
Is it possible to teleport using less than two classical bits or less than one ebit? 

In this section, we answer all these questions in the negative — all the protocols as given 
are optimal protocols. Here, we begin to see the beauty of the resource inequality formalism. 



1 If you will be working at a future quantum communication company, it also makes sense to find optimal 
protocols so that you can squeeze in more customers with your existing physical resources! 
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It allows us to chain protocols together to make new protocols. We exploit this idea in the 
forthcoming optimality arguments. 

First, let us tackle the optimality of entanglement distribution. Is there a protocol that 
implements any other resource inequality such as 

[?->?]> E[qq], (6.42) 

where the rate E of entanglement generation is greater than one? 

We show that such a resource inequality can never occur, i.e., it is optimal for E = 
1. Suppose such a resource inequality with E > 1 does exist. Under an assumption of 
free forward classical communication, we can combine the above resource inequality with 
teleportation to achieve the following resource inequality: 

[?-?]>£[?-?]■ (6-43) 

We could then simply keep repeating this protocol to achieve an unbounded amount of 
quantum communication, which is impossible. Thus, it must be that E = 1. 

Next, we consider the optimality of super-dense coding. We again exploit a proof by 
contradiction argument. Let us suppose that we have an unlimited amount of entanglement 
available. Suppose that there exists some "super-duper" dense coding protocol that generates 
an amount of classical communication greater than super-dense coding generates. That is, 
the classical communication output of super-duper-dense coding is 2C where C > 1, and its 
resource inequality is 

[<Z-?] + [<ra]>2tf[c-c]. (6.44) 

Then this super-duper-dense coding scheme (along with the infinite entanglement) gives the 
following resource inequality: 

[q ->■ q] + oo[qq] > 2C[c -»■ c] + oo[qq). (6.45) 

An infinite amount of entanglement is still available after executing the super-duper-dense 
coding protocol because it consumes only a finite amount of entanglement. We can then 
chain the above protocol with teleportation and achieve the following resource inequality: 

2C[c ->■ c] + oo[qq] > C[q ->■ q] + oo[qq]. (6.46) 

Overall, we have then shown a scheme that achieves the following resource inequality: 

[q -► q] + oo[qq] > C[q -► q] + oo[qq]. (6.47) 

We can continue with this protocol and perform it k times so that we implement the following 
resource inequality: 

[q^q] + oo[qq}>C k [q^q} + oo[qq]. (6.48) 

The result of this construction is that one noiseless qubit channel and an infinite amount 
of entanglement can generate an infinite amount of quantum communication. This result 
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is impossible physically because entanglement does not boost the capacity of a noiseless 
qubit channel. Also, the scheme is exploiting just one noiseless qubit channel along with 
the entanglement to generate an unbounded amount of quantum communication — it must 
be signaling superluminally in order to do so. Thus, the rate of classical communication in 
super-dense coding is optimal. 

We leave the optimality arguments for teleportation as an exercise because they are sim- 
ilar to those for the super-dense coding protocol. Note that it is possible to prove optimality 
of these protocols without assumptions such as free classical communication (for the case of 
entanglement distribution), and we do so in Chapter [ST 



Exercise 6.3.1 Show that it is impossible for C > 1 in the teleportation protocol where C 
is with respect to the following resource inequality: 

2[c^c] + [qq]>C[q^q}. (6.49) 

Exercise 6.3.2 Show that the rates of the consumed resources in the teleportation and 
super-dense coding protocols are optimal. 

6.4 Extensions for Quantum Shannon Theory 

The previous section sparked some good questions that we might ask as a quantum Shannon 
theorist. We might also wonder what types of communication rates are possible if some of 
the consumed resources are noisy, rather than being perfect resources. We list some of these 
questions below. 

Let us first consider entanglement distribution. Suppose that the consumed noiseless 
qubit channel in entanglement distribution is instead a noisy quantum channel J\f where A/" 
is some CPTP map. The communication task is then known as entanglement generation. 
We can rephrase the communication task as the following resource inequality: 

(AT) > E[qq\. (6.50) 

The meaning of the resource inequality is that we consume the resource of a noisy quantum 
channel J\f in order to generate entanglement between a sender and receiver at some rate E. 
We will make the definition of a quantum Shannon-theoretic resource inequality more precise 
when we begin our formal study of quantum Shannon theory, but the above definition should 
be sufficient for now. The optimal rate of entanglement generation with the noisy quantum 
channel M is known as the entanglement generation capacity of A/*. This task is intimately 
related to the quantum communication capacity of A/*, and we discuss the connection further 



in Chapter 23 



Let us now turn to super-dense coding. Suppose that the consumed noiseless qubit 
channel in super-dense coding is instead a noisy quantum channel A/". The name for this 
task is then entanglement-assisted classical communication. The following resource inequality 
captures the corresponding communication task: 

{M)+E[qq] >C[c^c\. (6.51) 
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The meaning of the resource inequality is that we consume a noisy quantum channel J\f 
and noiseless entanglement at some rate E to produce noiseless classical communication at 



some rate C. We will study this protocol in depth in Chapter 20. We can also consider the 
scenario where the entanglement is no longer noiseless, but it is rather a general bipartite 
state p AB that Alice and Bob share. The task is then known as noisy super-dense codingjj 



We study noisy super-dense coding in Chapter 21 The corresponding resource inequality is 



as follows (its meaning should be clear at this point): 

(p AB )+Q[q^q]>C[c^c]. (6.52) 

We can ask the same questions for the teleportation protocol as well. Suppose that 
the entanglement resource is instead a noisy bipartite state p AB . The task is then noisy 
teleportation and has the following resource inequality: 

(p AB )+C[c^c]>Q[q^q]. (6.53) 

The questions presented in this section are some of the fundamental questions in quantum 
Shannon theory. We arrived at these questions simply by replacing the noiseless resources 
in the three fundamental noiseless protocols with noisy ones. We will spend a significant 
amount of effort building up our knowledge of quantum Shannon-theoretic tools that will be 
indispensable for answering these questions. 



6.5 Three Unit Qudit Protocols 



We end this chapter by studying the qudit versions of the three unit protocols. It is useful 
to have these versions of the protocols because we may want to process qudit systems with 
them. 

The qudit resources are straightforward extensions of the qubit resources. A noiseless 
qudit channel is the following map: 

\i) A -> \i) B , (6.54) 

where { \i) }ie{o,.„,d-i} is some preferred orthonormal basis on Alice's system and {\i) }i 6 {o,...,d-i} 
is some preferred basis on Bob's system. We can also write the qudit channel map as the 
following isometry: 



lA ^B_J2\i) B {i\ A . (6.55) 



The map l A ^ B preserves superposition states so that 

d-l d-l 



y^ oti\i) A -> 22 "*!*)■ ( 6 - 56 ) 



i=0 i=0 



2 The name noisy super-dense coding could just as well apply to the former task of entanglement-assisted 
classical communication, but this terminology has "stuck" in the research literature for this specific quantum 
information processing task. 
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A noiseless classical dit channel or edit is the following map: 

\i)(i\ A ^\i)(i\ B , (6.57) 

\i)(j\ A ^0 for i^ j. (6.58) 

A noiseless maximally entangled qudit state or an edit is as follows: 

^ d i=0 

We quantify the "dit" resources with bit measures. For example, a noiseless qudit channel 
is the following resource: 

logdfe-g], (6.60) 

where the logarithm is base two. Thus, one qudit channel can transmit logo? qubits of 
quantum information so that the qubit remains our standard unit of quantum information. 
We quantify the amount of information transmitted according to the dimension of the space 
that is transmitted. For example, suppose that a quantum system has eight levels. We can 
then encode three qubits of quantum information in this eight-level system. 
Likewise, a classical dit channel is the following resource: 

log d[c->c], (6.61) 

so that a classical dit channel transmits log d classical bits. The parameter d here is the 
number of classical messages that the channel transmits. 
Finally, an edit is the following resource: 

\ogd[qq\. (6.62) 

We quantify the amount of entanglement in a maximally entangled state by its Schmidt rank 



Chapter 18). 



see Theorem 3.6.1). We measure entanglement in units of ebits (we return to this issue in 



6.5.1 Entanglement Distribution 



The extension of the entanglement distribution protocol to the qudit case is straightforward. 
Alice merely prepares the state |$) in her laboratory and transmits the system A' through 
a noiseless qudit channel. She can prepare the state |$) with two gates: the qudit analog 
of the Hadamard gate and the CNOT gate. The qudit analog of the Hadamard gate is the 



Fourier gate F introduced in Exercise 3.6.8 where 



F:|0->^Eexp{^}|i), M3) 
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so that 

d-i 



^j= d t r ^f}mi (6.64) 

The qudit analog of the CNOT gate is the following controlled- shift gate: 

d-i 
CNOT d = Y t \J)(J\^X(j), (6.65) 

3=0 



where X(j) is defined in (3.209) 



Exercise 6.5.1 Verify that Alice can prepare the maximally entangled qudit state |$) 
locally by preparing |0) |0) , applying F A and CNOT rf . Show that 

\$) AA ' = CNOT d • F A |0) A |0) A '. (6.66) 

The resource inequality for this qudit entanglement distribution protocol is as follows: 

log d[q — ► q] > log d[qq]. (6.67) 

6.5.2 Quantum Super-Dense Coding 

The qudit version of the super-dense coding protocol proceeds analogously to the qubit case, 
with some notable exceptions. It still consists of three steps: 

1. Alice and Bob begin with a maximally-entangled qudit state of the form: 

i*r=^fVi*> B ( 6 - 68 ) 



i=0 



Alice applies one of d 2 unitary operations in the set {X(x)Z(z)} x z=0 to her qudit. The 
shared state then becomes one of the d 2 maximally entangled qubit states in (3.240). 



2. She sends her qudit to Bob with one use of a noiseless qudit channel. 

3. Bob performs a measurement in the qudit Bell basis to determine the message Alice 



sent. The result of Exercise |3.6.11| is that these states are perfectly distinguishable 
with a measurement. 

This qudit super-dense coding protocol implements the following resource inequality: 

logc%g] + log d[q — > g] > 21ogd[c — ► c]. (6.69) 
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6.5.3 Quantum Teleportation 

The operations in the qudit teleportation protocol are again similar to the qubit case. The 
protocol proceeds in three steps: 

1. Alice possesses an arbitrary qudit \ip) where 



d-l 



5>,K) A '. (6.70) 



i=0 



AB 



Alice and Bob share a maximally-entangled qudit state |$ + ) of the form 



$r=^Eii)ii) s (6.7i; 



3=0 

The joint state of Alice and Bob is then \ip) |$) . Alice performs a measurement in 
the basis {\$ij) A ' A }i,j- 

2. She transmits the measurement result i, j to Bob with the use of two classical dit 
channels. 

3. Bob then applies the unitary transformation Z B (j)X B (i) to his state to "telecorrect" 
it to Alice's original qudit. 

We prove that this protocol works by analyzing the probability of the measurement result 
and the post-measurement state on Bob's system. The techniques that we employ here are 
different from those for the qubit case. 

First, let us suppose that Alice would like to teleport the A' system of a state \ip) that 
she shares with an inaccessible reference system R. This way, our teleportation protocol 
encompasses the most general setting in which Alice would like to teleport a mixed state 
on A'. Also, Alice shares the maximally entangled edit state |$) with Bob. Alice first 
performs a measurement of the systems A' and A in the basis {\&i,j) A A }i,j where 

\*ij) A ' A = Ug\*) A ' A , (6.72) 

and 

U a ; = Z A \j)X A '{i). (6.73) 



The measurement operators are thus 



\*ij)(*ij\ A ' A - (6-74) 



Then the unnormalized post-measurement state is 

\<S> t ,)(<S> t J A ' A \^) RA '\$) Ab . (6.75) 
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We can rewrite this state as follows, by exploiting the definition of \$ij) AA in (6.72) 



(6.76) 



Recall the "Bell-state matrix identity" in Exercise 3.6.12 that holds for any maximally en- 
tangled state |$). We can exploit this result to show that the action of the unitary U- on 
the A 1 system is the same as the action of the unitary U*a on the A system: 



) RA '\<$>) AB . 



Then the unitary \U*j) commutes with the systems R and A 1 

\^)m A ' A \^) RA ' (^)>> AB 



(6.77) 



(6.78) 



We can again apply the Bell state matrix identity in Exercise 3.6.12 to show that the state 
is equal to 

\*ij){*\ A ' A \tP) RA> (UI) B \$) AB . (6.79) 

Then we can commute the unitary ( [//■ J all the way to the left, and we can switch the 

order of \tf)) and |$) without any problem because the system labels are sufficient to 
track the states in these systems: 

(tfj) fl |*y>(*| AM \$) AB \tP) RA '. (6.80) 

Now let us consider the very special overlap ($| AA |<J>) of the maximally entangled 
edit state with itself on different systems: 



^\ A ' A \q>) AB 



d-l 

£ 



I -\A' i-\A 



;=0 



^%f w 



\Y,(i\ A '<i\ A \j) A \j) 



i,j=0 

d-l 



]J2(i\ A 'm A \j) B 



d 



i,j=0 

d-l 



1 a—L 

j=0 
1 d-l 



B/.iA' 
l\ 



i=0 



tA'^B 



(6.81) 

(6.82) 

(6.83) 

(6.84) 

(6.85) 
(6.86) 
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The first equality follows by definition. The second equality follows from linearity and 
rearranging terms in the multiplication and summation. The third and fourth equalities 
follow by realizing that (i\ \j) is an inner product and evaluating it for the orthonormal 

basis < \i) >. The fifth equality follows by rearranging the bra and the ket. The final equality 

is our last important realization: the operator Yli=o N) (*l * s ^ e n °i se l ess qudit channel 
ja ->b fagfc the teleportation protocol creates from the system A' to B (see the definition of 



a noiseless qudit channel in (6.55)). We might refer to this as the "teleportation isometry." 



We now apply the teleportation isometry to the state in (6.80): 

B 



(4) \^m A ' A \®) 



AB 



,RA' 



(4) i*«> 



A' A 1 tA'^B 



,RA' 



i 



d 

A'A\ 



d 



<^> 



A'A 



uh 



>RB 



>RB 



(6.87) 

(6.88) 
(6.89) 



We can compute the probability of receiving outcome i and j from the measurement when 
the input state is \ip) . It is just equal to the overlap of the above vector with itself: 



P[hJ\ 



-\<*v\ A ' A (^(Uijf 


'\\*u) A ' A {ui) B m RB 


(6.90) 


^(*y| A ' A l*y> A ' A ^r(^) fl (^) fl W M 


(6.91) 


j 2 (^r A \^) A ' A m rb m rb 


(6.92) 


i 

d 1 ' 




(6.93) 



Thus, the probability of the outcome (i,j) is completely random and independent of the 
input state. We would expect this to be the case for a universal teleportation protocol that 
operates independently of the input state. Thus, after normalization, the state on Alice and 
Bob's system is 



l<M 



A'A 



u 



X RB 



(6.94) 



At this point, Bob does not know the result of the measurement. We obtain his density 
operator by tracing over the systems A', A, and R to which he does not have access and 
taking the expectation over all the measurement outcomes: 

^'Aj2 £ \*«)(*u\ A ' A (oj) i Wi* fl (^) 



M=0 



1 d— 1 R 



(P 



TV 



i,j=0 



(6.95) 
(6.96) 
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The first equality follows by evaluating the partial trace and by defining 

V^Tr^l^XV^}. (6.97) 

The second equality follows because applying a Heisenberg-Weyl operator uniformly at ran- 
dom completely randomizes a quantum state to be the maximally mixed state (see Exer- 
cise |4A9j. 

Now suppose that Alice sends the measurement results % and j over two uses of a noiseless 
classical dit channel. Bob then knows that the state is 

Ul) B \^f B , (6.98) 

and he can apply (Uij) to make the overall state become \ijj) . This final step completes 
the teleportation process. The resource inequality for the qudit teleportation protocol is as 
follows: 

log d[qq] + 2 log d[c -> c] > log d[q -> q\. (6.99) 

Exercise 6.5.2 Show that 

<$ + | A ' A ((4) A >)<V^ (6-100) 

Exercise 6.5.3 Show that 

((^o B i^)^i B ^)^i $+ )( $+ i A 'i $+ >< $ i AB i $+ )( $+ i A ' A ( c/ i-) A ' 

= ^((4)>><v>i B ^) ®^> + )(^ + i A ' A (4) A '- ( 6 - 101 ) 

6.6 History and Further Reading 

This chapter presented the three important protocols that exploit the three unit resources of 
classical communication, quantum communication, and entanglement. We learned, perhaps 
surprisingly, that it is possible to combine two resources together in interesting ways to simu- 
late a different resource (in both super-dense coding and teleportation). These combinations 
of resources turn up quite a bit in quantum Shannon theory, and we see them in their most 
basic form in this chapter. 

Bennett and Wiesner published the super-dense coding protocol in 1992 [35], and within 
a year, Bennett et al. realized that Alice and Bob could teleport particles if they swap their 
operations with respect to the super-dense coding protocol [23] • These two protocols were 
the seeds of much later work in quantum Shannon theory. 
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Coherent Protocols 



We introduced three protocols in the previous chapter: entanglement distribution, teleporta- 
tion, and super-dense coding. The last two of these protocols, teleportation and super-dense 
coding, are perhaps more interesting than entanglement distribution because they demon- 
strate insightful ways that we can combine all three unit resources to achieve an information 
processing task. 

It appears that teleportation and super-dense coding might be "inverse" protocols with 
respect to each other because teleportation arises from super-dense coding when Alice and 
Bob "swap" their equipment. But there is a fundamental asymmetry between these protocols 
when we consider their respective resource inequalities. Recall that the resource inequality 
for teleportation is 

2[c-c] + [ M ] >[?-<?], (7.1) 

while that for super-dense coding is 

[<Z -?] + [??]> 2 [c-c]. (7.2) 

The asymmetry in these protocols is that they are not dual under resource reversal. Two 
protocols are dual under resource reversal if the resources that one consumes are the same 
that the other generates and vice versa. Consider that the super-dense coding resource 



inequality in (7.2) generates two classical bit channels. Glancing at the left hand side of the 



teleportation resource inequality in (7.1), we see that two classical bit channels generated 



from super-dense coding are not sufficient to generate the noiseless qubit channel on the 



right hand side of (7.1 ) — the protocol requires the consumption of noiseless entanglement in 
addition to the consumption of the two noiseless classical bit channels. 

Is there a way for teleportation and super-dense coding to become dual under resource 
reversal? One way is if we assume that entanglement is a free resource. This assumption is 
strong and we may have difficulty justifying it from a practical standpoint because noiseless 
entanglement is extremely fragile. It is also a powerful resource, as the teleportation and 
super-dense coding protocols demonstrate. But in the theory of quantum communication, 
we often make assumptions such as this one — such assumptions tend to give a dramatic sim- 
plification of a problem. Continuing with our development, let us assume that entanglement 
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is a free resource and that we do not have to factor it into the resource count. Under this 
assumption, the resource inequality for teleportation becomes 

2[c - c] > [q - q], (7.3) 

and that for super-dense coding becomes 

[<Z->?]>2[c-c]. (7.4) 

Teleportation and super-dense coding are then dual under resource reversal under the "free- 
entanglement" assumption, and we obtain the following resource equality: 

[g-g] = 2[c->c]. (7.5) 

Exercise 7.0.1 Suppose that the quantum capacity of a quantum channel assisted by an 
unlimited amount of entanglement is equal to some number Q. What is the capacity of that 
entanglement-assisted channel for transmitting classical information? 

Exercise 7.0.2 How can we obtain the following resource equality? (Hint: Assume that 
some resource is free.) 

[<?-?] = [<?<?]■ (7-6) 

Which noiseless protocols did you use to show the above resource equality? The above 
resource equality is a powerful statement: entanglement and quantum communication are 
equivalent under the assumption that you have found. 

Exercise 7.0.3 Suppose that the entanglement generation capacity of a quantum channel 
is equal to some number E. What is the quantum capacity of that channel when assisted by 
free, forward classical communication? 

The above assumptions are useful for finding simple ways to make protocols dual under 
resource reversal, and we will exploit them later in our proofs of various capacity theorems 
in quantum Shannon theory. But it turns out that there is a more clever way to make 
teleportation and super-dense coding dual under resource reversal. In this chapter, we intro- 
duce a new resource — the noiseless coherent bit channel. This resource produces "coherent" 
versions of the teleportation and super-dense coding protocols that are dual under resource 
reversal. The payoff of this coherent communication technique is that we can exploit it to 
simplify the proofs of various coding theorems of quantum Shannon theory. It also leads 
to a deeper understanding of the relationship between the teleportation and dense coding 
protocols from the previous chapter. 

7.1 Definition of Coherent Communication 

We begin by introducing the coherent bit channel as a classical channel that has quantum 



feedback. Recall from Exercise 6.1.1 that a classical bit channel is equivalent to a dephasing 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



7.2. IMPLEMENTATIONS OF A COHERENT BIT CHANNEL 195 



channel that dephases in the computational basis with dephasing parameter p = 1/2. The 
CPTP map corresponding to the dephasing channel is as follows: 

M{p)= l -{p + ZpZ). (7.7) 



The isometric extension t/jv of the above channel then follows by applying (5.25): 

U M = -L (l A ^ B ® \+) E + Z A ^ B ® |-} £ ) , (7.8) 

where we choose the orthonormal basis states of the environment E to be |+) and | — ) (recall 
that we have unitary freedom in the choice of the basis states for the environment). It is 
straightforward to show that the isometry Ujsf is as follows by expanding the operators I 
and Z and the states |+) and |— ): 

U M = |0} B (0| A <g> \0) E + |1} B (1| A <g> \1) E . (7.9) 

Thus, a classical bit channel is equivalent to the following linear map: 

\i) A ^\i) B \i) E :ie {0,1}. (7.10) 

A coherent bit channel is similar to the above classical bit channel map, with the exception 
that Alice regains control of the environment of the channel: 

\i) A ^ \i) B \i) A :ie {0,1}. (7.11) 

In this sense, the coherent channel is a quantum feedback channel. "Coherence" in this 
context is also synonymous with linearity — the maintenance and linear transformation of 
superposed states. The coherent bit channel is similar to classical copying because it copies 
the basis states while maintaining coherent superpositions. We denote the resource of a 
coherent bit channel as follows: 

[q - qq]. (7.12) 



Figure |7.1| provides a visual depiction of the coherent bit channel. 
Exercise 7.1.1 Show that the following resource inequality holds: 

[q^qq]>[c^c]. (7.13) 

That is, devise a protocol that generates a noiseless classical bit channel with one use of a 
noiseless coherent bit channel. 
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A 



B 



Figure 7.1: The above figure depicts the operation of a coherent bit channel. It is the "coherification" of a 
classical bit channel in which the sender A has access to the environment's output. For this reason, we also 
refer to it as the quantum feedback channel. 




Figure 7.2: A simple protocol to implement a noiseless coherent channel with one use of a noiseless qubit 
channel. 
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7.2 Implementations of a Coherent Bit Channel 

How might we actually implement a coherent bit channel? The simplest way to do so is with 
the aid of a local CNOT gate and a noiseless qubit channel. The protocol proceeds as follows 



(Figure 7.2 illustrates the protocol): 



1. Alice possesses an information qubit in the state \ip) = a\0) + j3\l) . She prepares 
an ancilla qubit in the state |0) . 

2. Alice performs a local CNOT gate from qubit A to qubit A'. The resulting state is 

a\0) A \0) A ' + f3\l) A \l) A ' . (7.14) 

3. Alice transmits qubit A' to Bob with one use of a noiseless qubit channel I A ~^ B . The 
resulting state is 

a\0) A \0) B + f3\l) A \±) B , (7-15) 

and it is now clear that Alice and Bob have implemented a noiseless coherent bit 



channel as defined in (7.11). 
The above protocol implements the following resource inequality: 

[q^q]>[q^qq], (7.16) 

demonstrating that quantum communication generates coherent communication. 
Exercise 7.2.1 Show that the following resource inequality holds: 

[?- qq]>[qq]. (7-17) 

That is, devise a protocol that generates a noiseless ebit with one use of a noiseless coherent 
bit channel. 

Exercise 7.2.2 Show that the following two resource inequalities cannot hold. 

[q^qq]> [q - q], (7.18) 

[qq}> [q^qq]- (7.19) 

We now have the following chain of resource inequalities: 

[q^q}> [q^qq]>[qq]- (7.20) 

Thus, the power of the coherent bit channel lies in between that of a noiseless qubit channel 
and a noiseless ebit. 
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Exercise 7.2.3 Another way to implement a noiseless coherent bit channel is with a variation 
of teleportation that we name coherent communication assisted by entanglement and classical 
communication. Suppose that Alice and Bob share an ebit |$ + ) . Alice can append an 
ancilla qubit |0) to this state, perform a local CNOT from A to A' to give the following 
state: 

,AA'B _ * (\aaa\AA'B . ,-,-,-, K AA'B' 



\^guz) AAB = -^ (J000r » + \\\\) AA ») . (7.21) 

Alice prepends an information qubit |-0) 1 = a|0) x + /3|1) 1 to the above state so that the 
global state is as follows: 

\^) Al \<S> G nz) AA ' B . (7.22) 

Suppose Alice performs the usual teleportation operations on systems Ay, A, and A'. Give the 
steps that Alice and Bob should perform in order to generate the state a |0) |0) +/3\1) |1) , 
thus implementing a noiseless coherent bit channel. Hint: The resource inequality for this 
protocol is as follows: 

[qq] + [c^c}> [q^qq]. (7.23) 

Exercise 7.2.4 Determine a qudit version of coherent communication assisted by classical 
communication and entanglement by modifying the steps in the above protocol. 



7.3 Coherent Dense Coding 



In the previous section, we introduced two protocols that implement a noiseless coherent bit 
channel: the simple method in the previous section and coherent communication assisted by 



classical communication and entanglement (Exercise 7.2.3). We now introduce a different 
method for implementing two coherent bit channels that makes more judicious use of avail- 
able resources. We name it coherent super-dense coding because it is a coherent version of 
the super-dense coding protocol. 



The protocol proceeds as follows (Figure 7.3 depicts the protocol): 



1. Alice and Bob share one ebit in the state |$ + ) before the protocol begins. 

2. Alice first prepares two qubits A\ and A^ in the state \a\) 1 \a,2) 2 and prepends this 
state to the ebit. The global state is as follows: 

h) A >2) A2 |$ + } AS , (7.24) 

where Oi and a<i are binary- valued. This preparation step is reminiscent of the super- 
dense coding protocol (recall that, in the super-dense coding protocol, Alice has two 
classical bits she would like to communicate). 

3. Alice performs a CNOT gate from register A2 to register A and performs a controlled- Z 
gate from register A\ to register A. The resulting state is as follows: 

\a 1 } Al \a2) A2 (Z ai X a *) A \$ + ) AB . (7.25) 
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Controlled Operations 




Figure 7.3: The above figure depicts the protocol for coherent super-dense coding. 



4. Alice transmits the qubit in register A to Bob. We rename this register as B\ and 
Bob's other register B as B 2 . 

5. Bob performs a CNOT gate from his register B\ to B 2 and performs a Hadamard gate 
on B\. The final state is as follows: 



I \ A\ I \ A 2 I \ -Bi I \B 2 

\a>i) \a 2 ) K) \a 2 ) . 



(7.26) 



The above protocol implements two coherent bit channels: one from Ai to Bi and another 
from A 2 to B 2 . You can check that the protocol works for arbitrary superpositions of two- 
qubit states on A\ and A 2 — it is for this reason that this protocol implements two coherent 
bit channels. The resource inequality corresponding to coherent super-dense coding is 

[qq] + [q^q]> 2[q -► qq}. (7.27) 

Exercise 7.3.1 Construct a qudit version of coherent super-dense coding that implements 
the following resource inequality: 



log d[qq] + log d[q — > q] > 2 log d[q — > qq]. 
(Hint: The qudit analog of a controlled-NOT gate is 



d-l 



J2\i)(i\®X(i), 



(7.28) 



(7.29) 



i=0 



where X is defined in (3.209). The qudit analog of the controlled-Z gate is 

d-l 
3=0 



(7.30) 



where Z is defined in (3.212). The qudit analog of the Hadamard gate is the Fourier transform 
gate.) 
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Figure 7.4: The above figure depicts the protocol for coherent teleportation. 



7.4 Coherent Teleportation 



We now introduce a coherent version of the teleportation protocol that we name coherent 
teleportation. Let a Z coherent bit channel A^ be one that copies eigenstates of the Z 
operator (this is as we defined a coherent bit channel before). Let an X coherent bit channel 
Ax be one that copies eigenstates of the X operator: 



Ax :\+Y 



\+) A \+)\ 
\-) A \-) B - 



(7.31) 
(7.32) 



It does not really matter which basis we use to define a coherent bit channel — it just matters 
that it copies the orthogonal states of some basis. 

Exercise 7.4.1 Show how to simulate an X coherent bit channel with a Z coherent bit 
channel and local operations. 



The protocol proceeds as follows (Figure 7.4 depicts the protocol): 



1. Alice possesses an information qubit 



where 



) A = a\Q) A + P\l) A . 



She sends her qubit through a Z coherent bit channel: 



A; 



> Ai n \Bi 



A\- t \B 1 



a |onop+/?|ini> 



,AB 1 



(7.33) 



(7.34) 



Let us rewrite the above state \ib) ABl as follows: 



, u; _ ( ,( 1+) +J-) | |Q) B 1+/ g/l+) 



-y 



i 

71 



|i) 



Bi 



\/2 / \ y/2 

+) A (a|0) Bl +/5|l} Bl )+|-) A (a|0) Bl -/5|l) i?1 



(7.35) 
(7.36) 
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2. Alice sends her qubit A through an X coherent bit channel with output systems A and 
B 2 : 



\AB-y 



1 



Ax ^i+n+PH ( >r+ m 



,Bi 



Bi 



1 



y/T 



-) A \-) B2 [u\Q) Bl -[3\l) Bl ] (7.37) 



3. Bob then performs a CNOT gate from qubit B\ to qubit B 2 . Consider that the action 
of the CNOT gate with the source qubit in the computational basis and the target 
qubit in the +/— basis is as follows: 



|0)|+}-|0}|+), 
|0)|->-|0)|->, 

so that the last entry catches a phase of -k [e m = - 
the overall state to 



(7.38) 
(7.39) 
(7.40) 
(7.41) 

-1). Then this CNOT gate brings 



1 

71 



\+) A \+) B2 (a|0) Bl + /5|l) Bl ) + \-) A \-) B2 (a|0) Bl + (J\l) Bl ) 



1 

71 
i - 

71 



\+r\+y 



h + \-r\-Y 



1+) 1+) 2 + l-> I-) 

AB 2 , A Bi 



>Bi 



(7.42) 

(7.43) 
(7.44) 



Thus, Alice teleports her information qubit to Bob, and both Alice and Bob possess 
one ebit at the end of the protocol. 



The resource inequality for coherent teleportation is as follows: 

2[q^ qq] > [qq] + [q -> q]. 



(7.45) 



Exercise 7.4.2 Show how a cobit channel and an ebit can generate a GHZ state. That is, 
demonstrate a protocol that implements the following resource inequality: 



[QQIab + [Q^ m] B c ^ few] 



ABC' 



(7.46) 



Exercise 7.4.3 Outline the qudit version of the above coherent teleportation protocol. The 
protocol should implement the following resource inequality: 



2 log d[q —* qq] > log d[qq] + log d[q —>■ q\. 



(7.47) 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



202 CHAPTER 7. COHERENT PROTOCOLS 



Exercise 7.4.4 Outline a catalytic version of the coherent teleportation protocol by modify- 
ing the original teleportation protocol. Let Alice possess an information qubit \ip) and let 
Alice and Bob share an ebit |$ + ) . Replace the Bell measurement with a controlled-NOT 
and Hadamard gate, replace the classical bit channels with coherent bit channels, and re- 
place Bob's conditional unitary operations with controlled unitary operations. The resulting 
resource inequality should be of the form: 



2 [ q _> qq ] + [qq] >[ q ^ q ] + 2 [qq\. (7.48) 



This protocol is catalytic in the sense that it gives the resource inequality in (7.45) when we 
cancel one ebit from each side. 

7.5 The Coherent Communication Identity 

The fundamental result of this chapter is the coherent communication identity: 

2[q->qq] = [qq] + [q->q]. (7.49) 

We obtain this identity by combining the resource inequality for coherent super-dense cod- 



ing in (7.27) and the resource inequality for coherent teleportation in (7.45). The coherent 
communication identity demonstrates that coherent super-dense coding and coherent telepor- 
tation are dual under resource reversal — the resources that coherent teleportation consumes 
are the same as those that coherent super-dense coding generates and vice versa. 

The major application of the coherent communication identity is in noisy quantum Shan- 
non theory. We will find later that its application is in the "upgrading" of protocols that 
output private classical information. Suppose that a protocol outputs private classical bits. 



The super-dense coding protocol is one such example, as the last paragraph of Section |6.2.3 
argues. Then it is possible to upgrade the protocol by making it coherent, similar to the way 
in which we made super-dense coding coherent by replacing conditional unitary operations 
with controlled unitary operations. 

We make this idea more precise with an example. The resource inequality for entanglement- 



assisted classical coding (discussed in more detail in Chapter 20) has the following form: 



{N)+E[ qq } >C[c^c], (7.50) 

where M is a noisy quantum channel that connects Alice to Bob, E is some rate of entangle- 
ment consumption, and C is some rate of classical communication. It is possible to upgrade 
the generated classical bits to coherent bits, for reasons that are similar to those that we used 
in the upgrading of super-dense coding. The resulting resource inequality has the following 
form: 

(N)+E[ qq \>C[q^q q \. (7.51) 



We can now employ the coherent communication identity in (7.49) and argue that any 



protocol that implements the above resource inequality can implement the following one: 

W + E[qq\>^[q^q\ + ^[qq\, (7.52) 
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merely by using the generated coherent bits in a coherent super-dense coding protocol. We 
can then make a "catalytic argument" to cancel the ebits on both sides of the resource 
inequality. The final resource inequality is as follows: 

W+(E-^j[qq]>^[q^q]. (7.53) 

The above resource inequality corresponds to a protocol for entanglement- assisted quantum 
coding (also known as the father protocol), and it turns out to be optimal for some channels 
as this protocol's converse theorem shows. This optimality is due to the efficient translation 
of classical bits to coherent bits and the application of the coherent communication identity. 

7.6 History and Further Reading 

Harrow introduced the idea of coherent communication in Ref. |123j . Later, some researchers 
studied the continuous- variable coherent channel as well |254j . The idea of coherent com- 
munication has many applications in quantum Shannon theory which we will study in later 
chapters. 
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CHAPTER 8 



The Unit Resource Capacity Region 



In Chapter [6j we presented the three unit protocols of teleportation, super-dense coding, and 



entanglement distribution. The physical arguments in Section 6.3 prove that each of these 
protocols are individually optimal. For example, recall that the entanglement distribution 
protocol is optimal because two parties cannot generate more than one ebit from the use of 
one noiseless qubit channel. 

In this chapter, we show that these three protocols are actually the most important 
protocols — we do not need to consider any other protocols when the noiseless resources of 
classical communication, quantum communication, and entanglement are available. Com- 
bining these three protocols together is the best that one can do with the unit resources. 

In this sense, this chapter gives the first example in this book of a converse proof of 
a capacity theorem. We construct a three-dimensional region, known as the unit resource 
achievable region, that the three unit protocols fill out. The converse proof of this chapter 
employs good physical arguments to show that the unit resource achievable region is optimal, 
and we can then refer to it as the unit resource capacity region. We later exploit the 



development here when we get to the study of trade-off capacities (see Chapter 24). 

8.1 The Unit Resource Achievable Region 

Let us first recall the resource inequalities for the three unit protocols. The resource inequal- 
ity for teleportation is 

2[c^c] + [qq]> [q -► q], (8.1) 

while that for super-dense coding is 

[q^q] + [qq]>2[c^c], (8.2) 

and that for entanglement distribution is as follows: 

[q^q]>[qq]. (8.3) 

Each of the resources [q — ► q] , [qq] , [c — ► c] is a unit resource. 
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The above three unit protocols are sufficient to recover all other unit protocols. For 
example, we can combine super-dense coding and entanglement distribution to produce the 
following resource inequality: 

2[q^q] + [qq]>2[c^c] + [qq]. (8.4) 

The above resource inequality is equivalent to the following one 

[?-><7]>[c->c], (8.5) 

after removing the entanglement from both sides and scaling by 1/2 (we can remove the 
entanglement here because it acts as a catalytic resource). We can justify this by considering 
a scenario in which we use the above protocol N times. For the first run of the protocol, we 
require one ebit to get it started, but then every other run both consumes and generates one 
ebit, giving 

2N[q^q] + [qq]>2N[c^c] + [qq]. (8.6) 

Dividing by TV gives the rate of the task, and as N becomes large, the use of the initial ebit 



is negligible. We refer to (8.5) as "classical coding over a noiseless qubit channel." 

We can think of the above resource inequalities in a different way. Let us consider a 
three-dimensional space with points of the form (C,Q,E), where C corresponds to noise- 
less classical communication, Q corresponds to noiseless quantum communication, and E 
corresponds to noiseless entanglement. Each point in this space corresponds to a protocol 
involving the unit resources. A coordinate of a point is negative if the point's corresponding 
resource inequality consumes that coordinate's corresponding resource, and a coordinate of a 
point is positive if the point's corresponding resource inequality generates that coordinate's 
corresponding resource. 

For example, the point corresponding to the teleportation protocol is 

xtp = (-2,1,-1), (8.7) 

because teleportation consumes two noiseless classical bit channels and one ebit to generate 
one noiseless qubit channel. For similar reasons, the respective points corresponding to 
super-dense coding and entanglement distribution are as follows: 

ssd = (2,-1,-1), (8.8) 

x ED = (0,-1,1). (8.9) 



Figure |8T| plots these three points in the three-dimensional space of classical communication, 
quantum communication, and entanglement. 

We can execute any of the three unit protocols just one time, or we can execute any one 
of them m times where m is some positive integer. Executing a protocol m times then gives 
other points in the three dimensional space. That is, we can also achieve the points mxxp, 
toxsd, and ma^D for any positive m. This method allows us to fill up a certain portion of 
the three-dimensional space. Let us also suppose that we can achieve real number amounts 
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Figure 8.1: The three points corresponding to the three respective unit protocols of entanglement distri- 
bution (ED), teleportation (TP), and super-dense coding (SD). 



of each protocol. This becomes important later on when we consider combining the three 
unit protocols in order to achieve certain rates of transmission (a communication rate can 
be any real number). Thus, we can combine the protocols together to achieve any point of 
the following form: 

ax TP + (3xsd + 1x ed , (8.10) 

where a, /3, 7 > 0. 

Let us further establish some notation. Let L denote a line, Q a quadrant, and O an 
octant in the three-dimensional space (it should be clear from context whether Q refers to 
quantum communication or "quadrant"). For example, L~ 00 denotes a line going in the 
direction of negative classical communication: 



L~ m = {o(-l, 0,0) :a> 0}. 



(8.1L 



Q + denotes the quadrant where there is zero classical communication, generation of quan- 
tum communication, and consumption of entanglement: 

Q 0+ - = {a(0, 1,0)+ (3(0, 0, -1) :a,/3> 0}. (8.12) 

+ ~ + denotes the octant where there is generation of classical communication, consumption 
of quantum communication, and generation of entanglement: 



o- 



a(l, 0,0) + 0(0, -1,0) + 7(0, 0,1) 

:a,/3,7>0 



(8.13) 



It proves useful to have a "set addition" operation between two regions A and B (known 
as the Minkowski sum): 

A + B = {a + b:ae A,be B}. (8.14) 
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The following relations hold 



Q°+- = L 0+0 + L 00 -, 
0+ _ + = L+00 + L o-o + L oo+ 



(8.15) 
(8.16) 



by using the above definition. 

The following geometric objects lie in the (C,Q,E) space: 

1. The "line of teleportation" L T p is the following set of points: 

L TP = {a(-2, 1,-1) : a > 0}. 

2. The "line of super-dense coding" Lsd is the following set of points: 

L SD = {/?(2,-l,-l):/?>0}. 

3. The "line of entanglement distribution" Led is the following set of points: 

L ED = (7(0,-1,1) : 7 >0}. 



(8.17) 



(8.18) 



(8.19) 



Definition 8.1.1. Let C\j denote the unit resource achievable region. It consists of all linear 
combinations of the above protocols: 



C\j = L T p + Lsd + Led- 
The following matrix equation gives all achievable triples (C, Q, E) in C\j\ 

Q 

E 
where «,/?, 7 > 0. We can rewrite the above equation with its matrix inverse: 



(8.20) 



2 


2 


" 




a 




-1 


-1 




P 


1 


-1 


1 




J. 



(8.21' 



-1/2 -1/2 -1/2" 
-1/2 -1/2 
-1/2 -1 



(8.22) 



in order to express the coefficients a, /?, and 7 as a function of the rate triples (C, Q, E). The 
restriction of non-negativity of a, j3, and 7 gives the following restriction on the achievable 
rate triples (C, Q, E): 



C + Q + E<0, 

Q + E<0, 

C + 2Q<0. 



(8.23) 
(8.24) 
(8.25) 



The above result implies that the achievable region C\j in (8.20) is equivalent to all rate 



triples satisfying (8.23 8.25). Figure 8.2 displays the full unit resource achievable region. 
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(0,-1,1) 




Figure 8.2: The above figure depicts the unit resource achievable region C\_ 



Definition 8.1.2. The unit resource capacity region C\j is the closure of the set of all points 
(C, Q, E) in the C, Q, E space, satisfying the following resource inequality: 



> C[c ^ c] + Q[q ^ q] + E[qq]. 



(8.26) 



The definition states that the unit resource capacity region consists of all those points 
(C, Q, E) that have corresponding protocols that can implement them. The notation in the 
above definition may seem slightly confusing at first glance until we recall that a resource 
with a negative rate implicitly belongs on the left-hand side of the resource inequality. 



Theorem |8.1.1| below gives the optimal three-dimensional capacity region for the three 
unit resources. 



Theorem 8.1.1. The unit resource capacity region C\j is equivalent to the unit resource 
achievable region Cy: 

C v = C v . (8.27) 



Proving the above theorem involves two steps: the direct coding theorem and the converse 
theorem. For this case, the direct coding theorem establishes that the achievable region C\j 
is in the capacity region C\j: 

C v C Cij. (8.28) 

The converse theorem, on the other hand, establishes that the achievable region C\j is opti- 
mal: 



Cu Q Cv- 



(8.29) 
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8.2 The Direct Coding Theorem 



The r esult of the direct coding theorem, that C\j C C\j, is immedi ate f rom the definition in 
(8.20) of the unit resource achievable region Ctj, the definition in (8.26) of the unit resource 
capacity region C\j, and the theory of resource inequalities. We can achieve points in the 



unit resource capacity region simply by considering positive linear combinations of the three 
unit protocols. The next section shows that the unit resource capacity region consists of all 
and only those points in the unit resource achievable region. 

8.3 The Converse Theorem 



We employ the definition of Ctj in (8.20) and consider the eight octants of the (C,Q,E) 
space individually in order to prove the converse theorem (that C\j C Ctj). Let (±, ±, ±) 
denote labels for the eight different octants. 

It is possible to demonstrate the optimality of each of these three protocols individually 
with a contradiction argument as we saw in Chapter [6} However, in the converse proof 
of Theorem 8.1.1 . we show that a mixed strategy combining these three unit protocols is 
optimal. 

We accept the following two postulates and exploit them in order to prove the converse: 



1. Entanglement alone cannot generate classical communication or quantum communica- 
tion or both. 

2. Classical communication alone cannot generate entanglement or quantum communica- 
tion or both. 



(+, +, +). This octant of Ctj is empty because a sender and receiver require some re- 
sources to implement classical communication, quantum communication, and entanglement. 
(They cannot generate a unit resource from nothing!) 

(+, +, — ). This octant of Ctj is empty because entanglement alone cannot generate 
either classical communication or quantum communication or both. 

(+, — , +). The task for this octant is to generate a noiseless classical channel of C bits 
and E ebits of entanglement by consuming \Q\ qubits of quantum communication. We thus 
consider all points of the form (C, Q, E) where C > 0, Q < 0, and E > 0. It suffices to prove 
the following inequality: 

C + E<\Q\, (8.30) 



because combining (8.30) with C > and E > implies (8.23 8.25). The achievability of 
(C, —\Q\,E) implies the achievability of the point (C + 2E, —\Q\ — E,0), because we can 



consume all of the entanglement with super-dense coding (8.2): 



(C + 2E,-\Q\ -E,0) = (C, -\Q\,E) + (2E, -E, -E). 



(8.31) 
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This new point implies that there is a protocol that consumes IQI + -E 1 noiseless qubit channels 
to send C + 2E classical bits. The following bound then applies 



C + 2E< \Q\+E, 



(8.32) 



because the Holevo bound (Exercise 4.2.2 gives a simpler statement of this bound) states 



that we can send only one classical bit per qubit. The bound in (8.30) then follows 



(+, — , — ). The task for this octant is to simulate a classical channel of size C bits using 
\Q\ qubits of quantum communication and \E\ ebits of entanglement. We consider all points 
of the form (C,Q,E) where C > 0, Q < 0, and E < 0. It suffices to prove the following 
inequalities: 



C<2|Q|, 
C< \Q\ + \E\ 



(8.33) 
(8.34) 



because combining (|8_33j)8_34j) with C > implies fl8.23||8.25[ ). The achievability of (C, -\Q\, -\E\) 
implies the achievability of (0, —\Q\ + C/2, — \E\ — C/2), because we can consume all of the 



classical communication with teleportation (8.1): 



(0, -\Q\ + C/2,-\E\ - C/2) = (C, -\Q\, -\E\) + {-C, C/2, -C/2). (8.35) 

The following bound applies (quantum communication cannot be positive) 

- |Q| + C/2 < 0, (8.36) 



because entanglement alone cannot generate quantum communication. The bound in (8.33) 



then follows from the above bound. The achievability of (C, — \Q\, — \E\) implies the achiev- 
ability of (C,—\Q\ — \E\,0) because we can consume an extra \E\ qubit channels with en- 



tanglement distribution (8.3): 



(C, -\Q\ - \E\,0) = (C, -\Q\, -\E\) + (0, -\E\, \E\). 



(8.37) 



The bound in (8.34) then applies by the same Holevo bound argument as in the previous 
octant. 

( — , +, +). This octant of C\j is empty because classical communication alone cannot 
generate either quantum communication or entanglement or both. 

( — , +, — ). The task for this octant is to simulate a quantum channel of size Q qubits 
using \E\ ebits of entanglement and \C\ bits of classical communication. We consider all 
points of the form (C,Q,E) where C < 0, Q > 0, and E < 0. It suffices to prove the 
following inequalities: 



Q< \E\, 

2Q<\C\, 



(8.38) 
(8.39) 
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because combining them with C < implies (8.23 8.25). The achievability of the point 



(— \C\,Q,— \E\) implies the achievability of the point (— |C|,0,Q — \E\), because we can 

(8.40) 



consume all of the quantum communication for entanglement distribution (8.3): 

(-\C\,0,Q - \E\) = (-\C\,Q,-\E\) + (0,-Q,Q). 
The following bound applies (entanglement cannot be positive) 

Q-\E\<0, (8.41 



because classical communication alone cannot generate entanglement. The bound in (8.38) 



follows from the above bound. The achievability of the point (— \C\, Q, — \E\) implies the 
achievability of the point (— \C\ + 2Q,0, — Q — \E\), because we can consume all of the 



quantum communication for super-dense coding (8.2): 



(-\C\+2Q,0,-Q-\E\) = (-\C\,Q,-\E\) + (2Q,-Q,-Q). 
The following bound applies (classical communication cannot be positive) 

-|C|+2Q<0, 



(8.42) 



(8.43) 



because entanglement alone cannot create classical communication. The bound in (8.39) 
follows from the above bound. 

( — ,—,+). The task for this octant is to create E ebits of entanglement using \Q\ qubits 
of quantum communication and \C\ bits of classical communication. We consider all points 
of the form (C,Q,E) where C < 0, Q < 0, and E > 0. It suffices to prove the following 
inequality: 

E<\Q\, (8.44) 



because combining it with Q < and C < implies (8.23 8.25). The achievability of 



(— \C\, —\Q\, E) implies the achievability of (—\C\ — 2E, — \Q\+E, 0), because we can consume 

(8.45) 



all of the entanglement with teleportation (8.1): 



(-\C\ - 2E, -\Q\ + E,0) = (-\C\, -\Q\, E) + (-2E, E, -E). 
The following bound applies (quantum communication cannot be positive) 

-\Q\ + E<0, 



3.46) 



because classical communication alone cannot generate quantum communication. The bound 



in (8.44) follows from the above bound. 

-, — ). C\j completely contains this octant. 



We have now proved that the set of inequalities in (8.23 8.25) holds for all octants of 



the (C,Q,E) space. The next exercises ask you to consider similar unit resource achievable 
regions. 
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Exercise 8.3.1 Consider the resources of public classical communication: 

[c - c] pub , (8.47) 



[c - c] priv , (8.48) 



private classical communication: 

and shared secret key: 

[cc] pTiv . (8.49) 

Public classical communication is equivalent to the following channel: 

p-I>|p|i> l*><»l B ®"?> ( 8 - 5 °) 

i 

so that an eavesdropper Eve obtains some correlations with the transmitted state p. Private 
classical communication is equivalent to the following channel: 



P 



Y,{i\p\i)\i)(i\ B ®v E , (8-51) 



so that Eve's state is independent of the information that Bob receives. Finally, a secret key 
is a state of the following form: 

$ AB <g> o E = I \ Y\t)(t\ A <8> \i){i\ B ) <8> a E , (8.52) 



(i£w<« 



so that Alice and Bob share maximal classical correlation and Eve's state is independent of it. 
There are three protocols that relate these three classical resources. Secret key distribution 
is a protocol that consumes a noiseless private channel to generate a noiseless secret key. It 
has the following resource inequality: 

[c - c] priv > [cc] priv . (8.53) 

The one-time pad protocol exploits a shared secret key and a noiseless public channel to 
generate a noiseless private channel (it simply XORs a bit of secret key with the bit that 
the sender wants to transmit and this protocol is provably unbreakable if the secret key is 
perfectly secret). It has the following resource inequality: 

[c - c] pub + [cc] priv > [c - c] priv . (8.54) 

Finally, private classical communication can simulate public classical communication if we 
assume that Bob has a local register where he can place information and he then gives this 
to Eve. It has the following resource inequality: 

[c - c] priv > [c -> c] pub . (8.55) 

Show that these three protocols fill out an optimal achievable region in the space of public 
classical communication, private classical communication, and secret key. Use the following 
two postulates to prove optimality: (1) public classical communication alone cannot generate 
secret key or private classical communication, (2) private key alone cannot generate public 
or private classical communication. 
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Exercise 8.3.2 Consider the resource of coherent communication from Chapter [7j 

[q -► qq}. (8.56) 



Recall the coherent communication identity in (7.49): 

2[q->qq] = [q^q] + [qq]. (8.57) 

Recall the other resource inequalities for coherent communication: 

[q^q]> [q^qq]>[qq]. (8.58) 

Consider a space of points (C,Q,E) where C corresponds to coherent communication, Q 
to quantum communication, and E to entanglement. Determine the achievable region one 
obtains with the above resource inequalities and another trivial resource inequality: 

[qq] > 0. (8.59) 

We interpret the above resource inequality as "entanglement consumption," where Alice 
simply throws away entanglement. 

8.4 History and Further Reading 

The unit resource capacity region first appeared in Ref. |160] in the context of trade-off 
coding. The private unit resource capacity region later appeared in Ref. |252j . 
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CHAPTER 9 



Distance Measures 



We discussed the major noiseless quantum communication protocols such as teleportation, 
super-dense coding, their coherent versions, and entanglement distribution in detail in Chap- 
ters [6J [7], and [8j Each of these protocols relies on the assumption that noiseless resources 
are available. For example, the entanglement distribution protocol assumes that a noiseless 
qubit channel is available to generate a noiseless ebit. This idealization allowed us to develop 
the main principles of the protocols without having to think about more complicated issues, 
but in practice, the protocols do not work as expected under the presence of noise. 

Given that quantum systems suffer noise in practice, we would like to have a way to 
determine how well a protocol is performing. The simplest way to do so is to compare the 
output of an ideal protocol to the output of the actual protocol using a distance measure 
of the two respective output quantum states. That is, suppose that a quantum information 
processing protocol should ideally output some quantum state \tp), but the actual output 
of the protocol is a quantum state with density operator p. Then a performance measure 
P(\ip),p) should indicate how close the ideal output is to the actual output. Figure 9.1 
depicts the comparison of an ideal protocol with another protocol that is noisy. 

This chapter introduces two distance measures that allow us to determine how close two 
quantum states are to each other. The first distance measure that we discuss is the trace 
distance and the second is the fidelity. (Though, note that the fidelity is not a distance 
measure in the strict mathematical sense — nevertheless, we exploit it as a "closeness" mea- 



\f) I -^ 

^ wf 



Vs. 



\vt 


N A^B 










^— P B 



Figure 9.1: A distance measure quantifies how far the output of a given ideal protocol (depicted on the 
left) is from an actual protocol that exploits a noisy resource (depicted as the noisy quantum channel A/" 
on the right). 
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sure of quantum states because it admits an intuitive operational interpretation.) These two 
measures are mostly interchangeable, but we introduce both because it is often times more 
convenient in a given situation to use one or the other. 

Distance measures are particularly important in quantum Shannon theory because they 
provide a way for us to determine how well a protocol is performing. Recall that Shannon's 
method (outlined in Chapter [2]) for both the noiseless and noisy coding theorem is to allow 
for a slight error in a protocol, but to show that this error vanishes in the limit of large 
block length. In later chapters where we prove quantum coding theorems, we borrow this 
technique of demonstrating asymptotically small error, with either the trace distance or the 
fidelity as the measure of performance. 



9.1 Trace Distance 



We first introduce the trace distance. Our presentation is somewhat mathematical because 
we exploit norms on linear operators in order to define it. Despite this mathematical flavor, 
we end this section with an intuitive operational interpretation of the trace distance. 

9.1.1 Trace Norm 

We begin by defining the trace norm or ti-norm \M\ X of an Hermitian operatoirjM: 

\\M\\ X = Tr{\/MtM}. (9.1) 

Recall that any function / applied to an Hermitian operator A is as follows: 

f(A) = J2f(^i)\i)(i\, (9-2) 

i 

where £V oti\i) (i\ is the spectral decomposition of A. With these two definitions, it is straight- 
forward to show that the trace norm of M is the absolute sum of its eigenvalues: 

ll^lli = 5>*l> ( 9 - 3 ) 

i 

where the spectral decomposition of M is ^i/ u *K)(^l- 

The trace norm is indeed a norm because it satisfies the following three properties: 
positive definiteness, homogeneity, and the triangle inequality. 

Property 9.1.1 (Positive Definiteness) The trace norm of an Hermitian operator M is 
always positive definite: 

ll M lli>0. (9.4) 

The trace norm is null if and only if the operator M is null: 

IlikTH-L = <* M = 0. (9.5) 



lr The trace norm applies to any operator, but we restrict ourselves to Hermitian operators to simplify the 
discussion. 
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Property 9.1.2 (Homogeneity) For any constant c G C, 

||cM|| 1 = |c|||M|| r (9.6) 

Property 9.1.3 (Triangle Inequality) For any two operators M and N, the following 
triangle inequality holds 

\\M + N\\ l <\\M\\ 1 + \\N\\ l . (9.7) 

Positive definiteness follows because the absolute sum of the eigenvalues of an operator 
is always non-negative, and the eigenvalues are null (and thus the operator is null) if and 
only if the absolute sum of the eigenvalues is null. Homogeneity follows because the absolute 
eigenvalues of cM are equal to \c\ times those of M. We later give a proof of the triangle 
inequality (though, for a special case only). 

Two other important properties of the trace norm are its invariance under isometries 
and convexity. Each of the below properties often arise as useful tools in quantum Shannon 
theory. 

Property 9.1.4 (Isometric invariance) The trace norm is invariant under conjugation 
by an isometry U: 

||[/Ml7t|| i = \\M\\ V (9.8) 

Property 9.1.5 (Convexity) For any two operators M and TV and any convex coefficients 
Ai, A2 > such that Ai + A2 = 1, the following convexity inequality holds 

||AiM + X 2 N\\ 1 < Ai||ikf || x + A 2 ||iV|| 1 . (9.9) 

Isometric invariance holds because M and UMU^ have the same eigenvalues. Convexity 
follows directly from the triangle inequality and homogeneity. 

9.1.2 Trace Distance from the Trace Norm 

The trace norm induces a natural distance measure between operators, called the trace 
distance. 

Definition 9.1.1 (Trace Distance). Given any two operators M and N , the trace distance 
between them is as follows: 



1= Tr\^J{M -N) ] {M -N)\. (9.10) 



\\M - N\ 

The trace distance is especially useful as a measure of the distinguishability of two quan- 
tum states with respective density operators p and a. The following bounds apply to the 
trace distance between any two density operators p and a: 

< H/5-ffllx < 2. (9.11) 
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The lower bound applies when two quantum states are equivalent — quantum states p and 
are equivalent to each other if and only if their trace distance is zero. The physical implication 
of null trace distance is that no measurement can distinguish p from 0. The upper bound 
follows from the triangle inequality: 

Hp-HIi < IMIi + IMIi = 2 - ( 9 - 12 ) 

The trace distance is maximum when p and have support on orthogonal subspaces. The 
physical implication of maximal trace distance is that there exists a measurement that can 
perfectly distinguish p from 0. We discuss these operational interpretations of the trace 
distance in more detail in Section 19.1.41 

Exercise 9.1.1 Show that the trace distance between two qubit density operators p and a 
is equal to the Euclidean distance between their respective Bloch vectors r and s , where 

p= I(J + 7.-?), a = !(/ + "?.■?). (9.13) 

That is, show that 

\\p-<?\\i= ll^-^L- ( 9 - 14 ) 

Exercise 9.1.2 Show that the trace distance obeys a telescoping property: 

||pi <8> P2 - o-i <g) 02 Hi < ||pi - o-i Hi + \\p 2 - 02II1, (9.15) 

for any density operators pi, p 2 , 01, u 2 . (Hint: First prove that 

\\p <S> u> — a <8> wlli = ||p — a\\ v (9.16) 

for any density operators p, 0, uo.) 
Exercise 9.1.3 Show that the trace distance is invariant under an isometric operation U: 

\\ p - a \\ i = \\UpU ] -UaU ] \\ v (9.17) 



The physical implication of (9.17) is that an isometry applied to both states does not increase 



or decrease the distinguishability of the two states. 

9.1.3 Trace Distance as a Probability Difference 

We now state and prove an important lemma that gives an alternative and useful way for 
characterizing the trace distance. This particular characterization finds application in many 
proofs of the lemmas that follow concerning trace distance. 

Lemma 9.1.1. The trace distance \\p — a\\ x between quantum states p and is equal to twice 
the largest probability difference that two states p and could give to the same measurement 
outcome A: 

llp-o-IL =2 max Tr{A(p-o-)}. (9.18) 

1 0<A</ 

The above maximization is with respect to all positive operators A with eigenvalues bounded 
from above by 1. 
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Proof. Consider that the difference operator p — a is Hermitian and we can diagonalize it as 
follows: 

p-a = UDU ] (9.19) 

= U(D + -D~)U ] (9.20) 

= UD + U ] - UD~U\ (9.21) 

where U is a unitary matrix of orthonormal eigenvectors, D is a diagonal matrix of eigen- 
values, D + is a diagonal matrix whose elements are the positive elements of D, and D~ is 
a diagonal matrix whose elements are the absolute value of the negative elements of D. We 
make the following assignments: 

a + = UD + U\ (9.22) 

a~ = UD~U j . (9.23) 

so that p — a = a + — a~ . Let fl + and fl~ be the projectors onto the respective eigenspaces 
of a + and a~ . The projectors obey the orthogonality property n + IT _ = because the 
eigenvectors of p — a are orthonormal. So the following properties hold: 



n + (p- 


-a)n+ = n+(« + - 


-«-)n + = n+a+n+ = «+ 


(9.24) 


n-( P - 


-a)rr = rr(a + - 


- a~)n~ = —H~a~U~ = —a~. 


(9.25) 



The following property holds as well 

\p — a\ = \a + — a~ | = a + + a~ (9.26) 

because the supports of a + and a~ are orthogonal and the absolute value of the operator 
a + — a~ takes the absolute value of its eigenvalues. Therefore, 

(9.27) 
(9.28) 
(9.29) 

But 

(9.30) 
(9.31) 
(9.32) 
(9.33) 

where the last equality follows because both quantum states have unit trace. Therefore, 
Tr{a+} = Tr{a } and 

||p-cr|| 1 = 2Tr{a + }. (9.34) 
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Consider then that 

2Tr{n + (p-a)} = 2Tr{li + (« + - a~) } (9.35) 

= 2Tr{n + «+} (9.36) 

= 2Tr{a + } (9.37) 

= \\p-a\\ v (9.38) 

Now we prove that the operator Il + is the maximizing one. Let A be any positive operator 
with spectrum bounded above by unity. Then 

2Tr{A(p-a)} = 2Tr{A(a + -a-)} (9.39) 

< 2Tr{Aa + } (9.40) 

< 2Tr{a + } (9.41) 
= lb "'Hi- (9- 42 ) 

The first inequality follows because A and a~ are positive and thus Tr{A«~} is positive. 



The second inequality holds because A < I. The final equality follows from (9.34). □ 



Exercise 9.1.4 Show that the trace norm of any Hermitian operator u is given by the 
following optimization: 

\\w\L = max Tt{Alo\. (9.43) 

1 -KA<I ' 



9.1.4 Operational Interpretation of the Trace Distance 

We now provide an operational interpretation of the trace distance as the distinguishability of 
two quantum states. The interpretation results from a hypothesis testing scenario. Suppose 
that Alice prepares one of two quantum states po or Pi f° r Bob to distinguish. Suppose 
further that it is equally likely a priori for her to prepare either p or p\. Let X denote the 
Bernoulli random variable assigned to the prior probabilities so that px(0) = PxiX) = 1/2. 
Bob can perform a binary POVM with elements {Ao, Ai} to distinguish the two states. That 
is, Bob guesses the state in question is po if he receives outcome "0" from the measurement or 
he guesses the state in question is p\ if he receives outcome "1" from the measurement. Let Y 
denote the Bernoulli random variable assigned to the classical outcomes of his measurement. 
The probability of error p e for this hypothesis testing scenario is the sum of the probability of 
detecting "0" when the state is p\ (a so-called Type II error) and the probability of detecting 
"1" when the state is po (a so-called Type I error): 

Pe = p Y \x{0\l)px{l) + PY\x{l\0)p x (0) (9.44) 

= Tr{A pi}i + Tr{A 1 po}i. (9.45) 
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We can simplify this expression using the completeness relation A + A x = /, 

p e = ^(Tr{A pi} + Tr{(/ - A )p }) (9.46) 

= ^(Tr{A pi} + Tr{p } - Tr{A p }) (9.47) 

= ^(Tr{A pi} + 1 - Tr{A p }) (9.48) 

= I(2-2Tr{A (p -pi)}). (9.49) 

Now Bob has freedom in choosing the POVM {A ,Ai} to distinguish the states p an d p\ 
and he would like to choose one that minimizes the probability of error p e . Thus, we can 
rewrite the error probability as follows: 

p e = min -(2 - 2Tr{A (p - Pi)})- (9.50) 

A ,Ai 4 
The minimization problem becomes a maximization as a result of the negative sign: 

p e = -[2-2 maxTr{A (p - Pi)} J . (9.51) 

4 \ A ,Ai / 

We can rewrite the above quantity in terms of the trace distance using its characterization in 



Lemma 9.1.1 because the expression inside of the maximization involves only the operator Aq: 



p e =-(2-||p -p 1 || 1 ). (9.52) 

Thus, the trace distance has the operational interpretation that it leads to the minimum 
probability of error in distinguishing two quantum states po and p\ in a quantum hypothesis 
testing experiment. From the above expression for probability of error, it is clear that the 
states are indistinguishable when ||po — Pi||i is null. That is, it is just as good for Bob 
to guess randomly what the state might be. On the other hand, the states are perfectly 
distinguishable when ||po — pi||i is maximal and the measurement that distinguishes them 
consists of two projectors: one projects onto the positive eigenspace of p — p\ and the other 
projects onto the negative eigenspace of p — p\. 

Exercise 9.1.5 Repeat the above derivation to show that the trace distance admits an oper- 
ational interpretation in terms of the probability of guessing correctly in quantum hypothesis 
testing. 

Exercise 9.1.6 Suppose that the prior probabilities in the above hypothesis testing scenario 
are not uniform but are rather equal to po and p\. Show that the probability of error is instead 
given by 

Pe = Po ~ ^\\PoPo ~ PiPiWv (9.53) 
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9.1.5 Trace Distance Lemmas 



We present several useful corollaries of Lemma 9.1.1 and their corresponding proofs. These 
corollaries include the triangle inequality, measurement on approximate states, and mono- 
tonicity of trace distance. Each of these corollaries finds application in many proofs in 
quantum Shannon theory. 

Lemma 9.1.2 (Triangle Inequality). The trace distance obeys a triangle inequality. For any 
three quantum states p, a, and t, the following inequality holds 

\\p - o"|li < Up - t|Ii + ||t - 0-|| r (9.54) 



Proof. Pick II as the maximizing operator for \\p — a\\ 1 (according to Lemma 9.1.1) so that 

||p-a|| 1 = 2Tr{n(p-a)} (9.55) 

= 2Tr{n(p-r)} + 2Tr{n(r-a)} (9.56) 

< l|p-f|| 1 +lk-<ii- (9.57) 

The last inequality follows because the operator II maximizing \\p — a\\ ± in general is not the 
same operator that maximizes both \\p — -7"j| a and ||r — a\\ v □ 

Corollary 9.1.1 (Measurement on Approximately Close States). Suppose we have two quan- 
tum states p and a and an operator U where < II < 7. Then 

Tr{Hp} > Tr{ILT}- \\p - a^ (9.58) 

Proof. Consider the following arguments. 

llp-0-IL =2 max{Tr{A(a-p)}} (9.59) 

> max {Tr{A(a - p)}} (9.60) 

>Tr{n(a-p)} (9.61) 

= Tr{IhT} - Tr{np}. (9.62) 

The first equality follows from Lemma |9.1.1[ The first inequality follows from the fact that 

2 max {Tr{A(t7 - p)}} > 0, (9.63) 

and 2 > 1. The second inequality follows because A is the maximizing operator and can 
only lead to a probability difference greater than that for another operator II such that 
< n < 7. □ 



The most common way that we employ Corollary 9.1.1 in quantum Shannon theory is 
in the following scenario. Suppose that a measurement with operator II succeeds with high 
probability on a quantum state a: 

Tr{IIa} > 1 - e, (9.64) 
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Figure 9.2: The task in the above figure is for Bob to distinguish the state p AB from the state a AB with 
a binary- valued measurement. Bob could perform an optimal measurement on system A alone if he does 
not have access to system B. If he has access to system B as well, then he can perform an optimal joint 
measurement on systems A and B. We would expect that he can distinguish the states more reliably if 
he performs a joint measurement because there could be more information about the state available in the 
other system B. Since the trace distance is a measure of distinguishability, we would expect it to obey the 
following inequality: up — u < \\p — <t ab \\ (the states are less distinguishable if fewer systems are 
available to be part of the distinguishability test). 



where e is some small positive number. Suppose further that another quantum state p is 
e-close in trace distance to a: 



\p-°\\i < e - 



(9.65) 



Then Corollary |9.1.1| gives the intuitive result that the measurement succeeds with high 

(9.66) 



probability on the state p that is close to a: 

Tr{IIp} > 1 - 2e, 



by plugging (9.64) and (9.65) into (9.58) 



Exercise 9.1.7 Prove that Corollary 9.1.1 holds for arbitrary Hermitian operators p and a 



by exploiting the result of Exercise 9.1.4 



We next turn to the monotonicity of trace distance under the discarding of a system. The 
interpretation of this corollary is that discarding of a system does not increase distinguisha- 
bility of two quantum states. That is, a global measurement on the larger system might be 
able to distinguish the two states better than a local measurement on an individual subsys- 



tem could. In fact, the proof of monotonicity follows this intuition exactly, and Figure 9.2 
depicts the intuition behind it. 

Corollary 9.1.2 (Monotonicity). The trace distance is monotone under discarding of sub- 
systems: 

(9.67) 



\„A A\\ s \\ ~AB AB I 

\P -o- Hj <\\p -a | 



i" 



Proof. Consider that 



\p A -a A \\ l = 2Ti{A A (p A -a A )}, 



(9.68) 
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for some positive operator A A with spectrum bounded by one. Then 
2 Tr{A A (p A -a A )}=2 Tr{A A <g> I B (p AB - a AB )} 



AB I AB 



<2 max Tr\A AB (p 

0<A AB <I 



a' 



AB 



)} 



(9.69) 
(9.70) 



,AB 



AB\ 



\pr--<r~\\ v (9.71) 

The first equality follows because local predictions of the quantum theory should coincide 



with its global predictions (as discussed in Exercise 4.3.8). The inequality follows because 
the local operator A A never gives a higher probability difference than a maximization over 
all global operators. The last equality follows from the characterization of the trace distance 
in Lemma [9.1.11 □ 

Exercise 9.1.8 (Monotonicity of Trace Distance under Noisy Maps) Show that the 
trace distance is monotone under the action of any quantum channel AT: 



\\Af(p)-Ar(a)\\ 1 <\\p 



cr 



(9.72) 



(Hint: Use the result of Corollary 9.1.2 and Exercise 9.1.3 



The result of the previous exercise deserves an interpretation. It states that a quantum 
channel Af makes two quantum states p and a less distinguishable from each other. That is, 
a noisy channel tends to "blur" two states to make them appear as if they are more similar 
to each other than they are before the quantum channel acts. 

Exercise 9.1.9 Show that the trace distance is strongly convex. That is, for two ensembles 
{px 1 (x),p x } and {px 2 ( x ), cr x} the following inequality holds 



Yl Pxi ( x )Px - Yl Px ? ( x "> a * 



< ^|Pxx(aO ~Px 2 (x)\ +^2pxi(x)\\p x -<rx\\ 1 - (9-73) 



9.2 Fidelity 



9.2.1 Pure-State Fidelity 

An alternate measure of the closeness of two quantum states is the fidelity. We introduce 
its most simple form first. Suppose that we input a particular pure state \if)) to a quantum 
information processing protocol. Ideally, we may want the protocol to output the same state 
that is input, but suppose that it instead outputs a pure state \<f>). The pure-state fidelity 
F(\tjj), |0)) is a measure of how close the output state is to the input state. 
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Definition 9.2.1 (Pure-State Fidelity). The pure-state fidelity is the squared overlap of the 
states \ip) and \<f>): 

F(|^},|0}) = |^|0}| 2 . (9.74) 

It has the operational interpretation as the probability that the output state \<j>) would pass 
a test for being the same as the input state \i/j), conducted by someone who knows the input 



state (see Exercise 9.2.2). 

The pure-state fidelity is symmetric 

F(|^},|0}) = F(|0},|V}), (9-75) 

and it obeys the following bounds: 

< F(k/>>, \<f>)) < 1. (9.76) 

It is equal to one if and only if the two states are the same, and it is null if and only if the 
two states are orthogonal to each other. The fidelity measure is not a distance measure in 
the strict mathematical sense because it is equal to one when two states are equal, whereas 
a distance measure should be null when two states are equal. 

Exercise 9.2.1 Suppose that two quantum states \ip) and \(f>) are as follows: 

X X 

where (|x)} is some orthonormal basis. Show that the fidelity F(\ifi), \(f))) between these two 
states is equivalent to the Bhattacharyya distance between the distributions p(x) and q(x): 



nw,\<i>)) 



J2 Vp{x)q(x) 



(9.78) 



9.2.2 Expected Fidelity 

Now let us suppose that the output of the protocol is not a pure state, but it is rather a 
mixed state with density operator p. In general, a quantum information processing protocol 
could be noisy and map the pure input state \i/j) to a mixed state. 

Definition 9.2.2 (Expected Fidelity). The expected fidelity F(\tp),p) between a pure state 
\ifi) and a mixed state p is 

F(\il>),p) = (,l>\p\1>). (9.79) 

We now justify the above definition of fidelity. Let us decompose p according to its 
spectral decomposition p = ^2 x Px{x)\<j>x){ < f>x\- Recall that we can think of this output 
density operator as arising from the ensemble {px{ x ), \4>x)}- We generalize the pure state 
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fidelity from the previous paragraph by defining it as the expected pure state fidelity, where 
the expectation is with respect to states in the ensemble: 

F(|^},p)^E x [|(^ x )| 2 ] (9.80) 

= ]Tpx(*)KV^}| 2 (9-81) 



Y^Pxl*)W>\<i>*)(<t>*W) ( 9 - 82 ) 

X 

W\fcpx{x)\<l> x )(<l>*\]\i!>) (9-83) 

M/#>- (9-84) 



The compact formula F(\ifi),p) = (ip\p\ip) is a good way to characterize the fidelity when 
the input state is pure and the output state is mixed. We can see that the above fidelity 



measure is a generalization of the pure state fidelity in (9.74). It obeys the same bounds: 



0<F(p»)<l, (9.85) 

being equal to one if and only if the state p is equivalent to \ip) and equal to zero if and only 
if the support of p is orthogonal to \(f>). 

Exercise 9.2.2 Given a state a, we would like to see if it would pass a test for being close 
to another state \ip). We can measure the observable {\(p)((p\,I — \(p)(<p\} with result (p 
corresponding to a "pass" and the result I — if corresponding to a "fail." Show that the 
fidelity is then Pr{"pass"}. 



Exercise 9.2.3 Using the result of Corollary |9.1.1[ show that the following inequality holds 
for a pure state \4>) and mixed states p and a: 

F(p,<P)<F(a, ( p) + \\p-a\\ 1 (9.86) 

9.2.3 Uhlmann Fidelity 

What is the most general form of the fidelity when both quantum states are mixed? We can 
borrow the above idea of the pure state fidelity that exploits the overlap between two pure 
states. Suppose that we would like to determine the fidelity between two mixed states p A 
and a A that each live on some quantum system A. Let \(f> p ) RA and \4> a ) denote particular 
respective purifications of the mixed states to some reference system R. We can define 
the Uhlmann fidelity F(p A ,a A ) between two mixed states p A and o A as the maximum 
overlap between their respective purifications, where the maximization is with respect to all 
purifications \4> p ) and \(j) a ) of the respective states p and a: 

F(p,a)= max ^ p \^ a ) RA \ 2 . (9.87) 

\4> P ) RA , \4>«) RA 
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We can express the fidelity as a maximization over unitaries instead (recall the result of 



Exercise 5.1.2 that all purifications are equivalent up to unitaries on the reference system): 



F(p,a) 



max 

U p ,Ua- 



max 
u p ,u a 



(<l> P \((ut) R ®i A )(u?®i A )\^) 
(4> P \(ulu a ) R q i^)** 



RA 



(9.88) 
(9.89) 



It is unnecessary to maximize over two sets of unitaries because the product U\U a represents 
only a single unitary. The final expression for the fidelity between two mixed states is then 
defined as the Uhlmann fidelity. 

Definition 9.2.3 (Uhlmann Fidelity). The Uhlmann fidelity F (p A , a A ) between two mixed 
states p A and o A is the maximum overlap between their respective purifications, where the 
maximization is with respect to all unitaries U on the purification system R: 



F(p,a) 



max 

u 



^\U R ®I A \^) RA 



(9.90) 



We will find that this notion of fidelity generalizes the pure-state fidelity in (9.74) and 



the expected fidelity in (9.84). This holds because the following formula for the fidelity of 



two mixed states, characterized in terms of the £i-norm, is equivalent to the above Uhlmann 
characterization: 



F (p,°) = \\VpV°\\v 

We state this result as Uhlmann's theorem. 



(9.91) 



Theorem 9.2.1 (Uhlmann's Theorem). The following two expressions for fidelity are equiv- 
alent: 



F(p,a) 



max 
u 



I r~ /~\\ 2 
WPV°\\y 



(9.92) 



<^ltf fl ®J A l6,)* A 

Proof. We can obtain the state p by partial tracing over the system R in the following state: 

l^}^^/*®^ 4 )^, (9.93) 

where y/p is an operator acting on the A system and |$) is the maximally entangled 
state with respect to some basis {\i)}'- 



m 



HA 



1 d 



Vd 



(9.94) 



x=l 



Therefore, the state \(f) p ) RA is a particular purification of p. Consider another state o A . We 
can also obtain o~ A as a partial trace over the R system of the following state: 



BA 



-Mil 



R 






a A )m RA 



(9.95) 
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and so \4> a ) is a purification of a. Consider that the overlap I {4> p \U R <8> I A \(f> a ) I is as follows: 



|<<^}| 2 = \d($\ RA (u R <g> ^p A ) (i R ® ^ A ) |$) 



ha 

2 



'•J 



E<*i*<^(^(vM^ T )>ni> 



'../ 



rT\ A l -\^ 



£<ttf>*<t| A (Vp>/^Ttf> 



'•J 



i 

Tr{^pV^U T }\ 2 . 



(9.96) 
(9.97) 

(9.98) 

(9.99) 

(9.100) 
(9.101) 



The first equality follows by plugging (9.93) and (9.95) into the overlap expression \{4> p \4>a)\ 



The second through fifth equalities follow by evaluation. The last equality follows by the 
definition of the trace. Using the result of Theorem A.0.3| in Appendix [Aj it holds that 



|<0 p |^>| 2 =|Tr{vW^ T } 



<Tr{|^P^|} 

ii r- rw 2 



(9.102) 
(9.103) 
(9.104) 



Choosing U T as the inverse of the unitary in the right polar decomposition of ^J~f>\fo saturates 
the upper bound above. This unitary U is also the maximal unitary in the Uhlmann fidelity 
in (19. 90b. □ 



9.2.4 Properties of Fidelity 

We discuss some further properties of the fidelity that often prove useful. Some of these 
properties are the counterpart of similar properties of the trace distance. From the charac- 



terization of fidelity in (9.92), we observe that it is symmetric in its arguments: 

F(p,a) = F(a,p). 



It obeys the following bounds: 



< F(p,a) < 1. 



(9.105) 



(9.106) 



The lower bound applies if and only if the respective supports of the two states p and a are 
orthogonal. The upper bound applies if and only if the two states p and a are equal to each 
other. 
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Property 9.2.1 (Multiplicativity over tensor products) The fidelity is multiplicative 
over tensor products: 



F{p\ <8> p 2 , o-i <S> o 2 ) = F(pi,a 1 )F(p 2 , a 2 ). 



(9.107) 



This result holds by employing the definition of the fidelity in (9.91). 



Property 9.2.2 (Joint Concavity) The fidelity is jointly concave in its input arguments: 



f1^2px(x)Px,^2px{x)<j x \ > ^2p x (x)F(p x ,(Tz 

\ X X / X 



(9.108) 



5.1.4 



Suppose {</>.) 



RA 



Proof. We prove joint concavity by exploiting the result of Exercise 

and \4>a x ) RA are respective Uhlmann purifications of p x and a x (these are purifications that 

maximize the Uhlmann fidelity). Then 



F{\ ( f )px ) RA ,\ ( f )ax ) RA )=F(p x ,a x ). 
Choose some orthonormal basis <\x) >. Then 

|0 p ) = £ VpxJ^)\^) RA \x) x , |^> = J2 VpxJx)\^ x ) RA \x) 



(9.109) 



x 



(9.110) 



are respective purifications of J2 x Px(x)p x and J2 x Px(x)a x . The first inequality below holds 
by Uhlmann's theorem: 



f[J2px(x)p x ,^2p x (x)o- x \ > \((j) p \(j)a)\ 2 

\ X X / 

X 

= ^2px(x)F(p x ,a x ). 

X 

Property 9.2.3 (Concavity) The fidelity is concave over one of its arguments 
F(Api + (1 - X)p 2 , a) > XF{p u a) + (1 - X)F{p 2 , a). 



(9.111) 
(9.112) 
(9.113) 

D 
(9.114) 



Concavity follows from joint concavity (Property 9.2.2). 

The following monotonicity lemma is similar to the monotonicity lemma for trace distance 



(Lemma 9.1.2) and also bears the similar interpretation that quantum states become more 



similar (less distinguishable) under the discarding of subsystems. 
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Lemma 9.2.1 (Monotonicity). The fidelity is non- decreasing under partial trace: 

F(p AB ,a AB )<F(p A ,a A ), (9.115) 

where 



p A = Tr B {p AB }, a A = Tr B {a AB }. 



(9.116) 



RAB 



Proof. Consider a fixed purification \ip) of p and p and a fixed purification 
o A and o AB . By Uhlmann's theorem, 

F{p J 



X RAB 



of 



AB .a AB ) 



max 

u R ®i AB 



R k> tAB\ 



'MU H ®I 



RAB 



F(p A ,° A )= n ™% iA {w RB ®i A \*i>y 



T RB - 7-Ai a-R^s 



(9.117) 
(9.118) 



The inequality in the statement of the theorem then holds because the maximization of 



[tP\U RB ® I A \(f)) 



RAB 



[i(j\U R ®I AB \(f> 
F(p AB ,a AB ). 



RAB 



for F(p A , a A ) is inclusive of all the unitaries in the maximization of 
for F(p AB ,a AB ). Thus, F(p A ,a A ) can only be larger or equal to 



Exercise 9.2.4 Show that we can express the fidelity as 



F(p,a) =Trt J y/pay/p\ , 



□ 



(9.119) 



using the definition in (9.91). 



Exercise 9.2.5 Show that the fidelity is invariant under an isometry U: 

F(p,a) = F{UpU ] ,UaU ] ). (9.120) 

Exercise 9.2.6 Show that the fidelity is monotone under a noisy quantum operation J\f: 

F(p,a)<F(Af(p),Af(a)). (9.121) 



Exercise 9.2.7 Suppose that Alice uses a noisy quantum channel and a sequence of quantum 
operations to generate the following state, shared with Bob and Eve: 



j=^\m) A \m) Bl \(j) m ) 



B 2 E 



(9.122) 



where Alice possesses the system A, Bob possesses systems B\ and B2, and Eve possesses 
the system E. Let 0^ denote the partial trace of \4> m ) 2 over Bob's system B2 so that 



rn 



Tr B2 {|0 m )(0 m | B2£ }. 



(9.123) 
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Figure 9.3: The above figure depicts the protocol relevant to Exercise 9.2.7 Alice transmits one share 
of an entangled state through a noisy quantum channel with isometric extension U. Bob and Eve receive 
quantum systems as the output of the isometry Bob performs some quantum operations so that Alice, Bob, 



and Eve share the state in (9.122). Exercise 9.2.7 asks you to determine a decoupling unitary that Bob can 



perform to decouple his system B\ from Eve. 



Suppose further that F(^,^ £; ) = 1, where 6 E is some constant density operator (indepen- 
dent of m) living on Eve's system E. Determine a unitary that Bob can perform on his 
systems P>i and B 2 so that he decouples Eve's system E, in the sense that the state after 
the decoupling unitary is as follows: 



1 



'M 



2_.l m ) \m) 1 



.b 2 e 



(9.124) 



where \(f)g) 2 is a purification of the state 9 E . The result is that Alice and Bob share 
maximal entanglement between the respective systems A and B\ after Bob performs the 



decoupling unitary Figure 9.3 displays the protocol. 



9.3 Relationships between Trace Distance and Fidelity 

In quantum Shannon theory we are interested in showing that a given quantum information 
processing protocol approximates an ideal protocol. We might do so by showing that the 
quantum output of the ideal protocol, say p, is approximately close to the quantum output 
of the actual protocol, say a. For example, we may be able to show that the fidelity between 
p and a is high: 

F(p,a)>l-e, (9.125) 
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where e is a small, positive real number that determines how well p approximates a according 
to the above fidelity criterion. Typically, in a quantum Shannon-theoretic argument, we will 
take a limit to show that it is possible to make e as small as we would like. As the performance 
paramater e becomes vanishingly small, we expect that p and a are becoming approximately 
equal so that they are identically equal when e vanishes in some limit. 

We would naturally think that the trace distance should be small if the fidelity is high 
because the trace distance vanishes when the fidelity is one and vice versa (recall the con- 



ditions for saturation of the bounds in (9.11) and (9.106)). The next theorem makes this 



intuition precise by establishing several relationships between the trace distance and fidelity. 

Theorem 9.3.1 (Relations between Fidelity and Trace Distance). The following bound 
applies to the trace distance and the fidelity between two quantum states p and a: 

1 - y/F(p,a) <\\\P- Hli < \/1-F(p,<t). (9.126) 

Proof. We first show that there is an exact relationship between fidelity and trace distance 
for pure states. Let us pick two arbitrary pure states \tp) and \(f>). We can write the state 
\(j>) in terms of the state \i/j) and its orthogonal complement k/'" L ) as follows: 

\c/>) = cos(0) |V) + e i *'siii(0)|V- L ). (9.127) 

First, the fidelity between these two pure states is 

F(|V},|0}) = |<0b/>)| 2 = cos 2 (#). (9.128) 

Now let us determine the trace distance. The density operator \4>){4>\ is as follows: 

|0}(0| = ( C os(0) |V>> + e iv sm(0) | ^»(cos(0) (-01 + e"^ sin(0)(V^|) (9.129) 

= cos 2 (0)|^}(V|+e^sin(0)cos(0)|^>(^l 

+ e- iv> cos(6)sm(6)\ij)(ij ± \ + sin 2 (9)\ip L )(ip L \. (9.130) 

The matrix representation of the operator \tp)(ip\ — \(f))((f)\ with respect to the basis {\if)), V" 1 ")} 
is 

l-cos 2 (0) -e-^sin(0)cos(0)l , , 

-e^sin(0)cos(0) -sin 2 (0) J' [ } 

It is straightforward to show that the eigenvalues of the above matrix are |sin(0)| and 
— |sin(0)| and it then follows that the trace distance between \tp) and \4>) is the absolute 
sum of the eigenvalues: 

HhWI -|0}(0||| 1 = 2|sin(0)|. (9.132) 

Consider the following trigonometric relationship: 

2|sin(0)|\ 2 9/n . , 

' = 1 - cos 2 (0). (9.133) 
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It then holds that the fidelity and trace distance for pure states are related as follows: 

2 



1 



MM-itfwiiii =i-iw>m 



(9.134) 



by plugging Q9.128) into the RHS of Q9.133|) and (|9.132|) into the LHS of (|9_133j). Thus, 

(9.135) 



hmw-mm^y/^WM- 



To prove the upper bound for mixed states p A and a A , choose purifications 
of respective states p A and a A such that 



F(p A ,a A ) = |<<^ p }| 2 = F(\<P p ) RA ,\<f> a ) 



,RA 



(Recall that these purifications exist by Uhlmann's theorem.) Then 

1 



\p A ~ o A 



i 1i 

< - 
li - 2 I 



,RA 



lRA\ 



RA 



1-F[\<p p )^,\^ a ) 
where the first inequality follows by the monotonicity of the trace distance under the dis 



^ p ) RA and |0 ff ) KA 
(9.136) 

(9.137) 

(9.138) 
(9.139) 



carding of systems (Lemma 9.1.2). To prove the lower bound for mixed states p and a, we 
first state the following theorems without proof. It is possible to show that the trace distance 
is the maximum Kolmogorov distance between two probability distributions resulting from 
a POVM {A m } acting on the states p and a: 



\p-a\ 



max ) \p r , 



Qr, 



(9.140) 



where 

p m = Tr{A m p}, q m = Tr{A m a}. (9.141) 

It is also possible to show that the fidelity is the minimum Bhattacharya distance between 
two probability distributions p' m and q' m resulting from a measurement {F m } of the states p 
and a: 



F(p, a) = min J^ y/PmQm > 

|im| \m / 



(9.142) 



where 

p' m = Tr{r m p}, q' m = Tr{r m a}. (9.143) 

We return to the proof. Suppose that the POVM F m achieves the minimum Bhattacharya 
distance and results in probability distributions p' m and q' m , so that 

2 



F (pi a ) = ( Yl VP'ml'm 

\ m 



(9.144) 
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Consider that 



5^(\/?4-\Am) =J2 P 'rn + < l'm- VP'mVm 

rn m 

= 2-2^F(p,a) 



It also follows that 



£ 



<i' ni ) <J2\"Spl- 

771 

/ j Wm 2717 

III 

< ^\Pm - q m 

■III 

= \\p-<r\\v 



(9.145) 
(9.146) 

^+Vqin\ (9-147) 

(9.148) 

(9.149) 

(9.150) 
y/qJnl- The second inequality 



The first inequality holds because | \/p^ — \fq r n \ < | \/Pm 
holds because the distributions p' m and q' m minimizing the Bhattacharya distance in gen- 
eral have Kolmogorov distance less than the distributions p m and q m that maximize the 
Kolmogorov distance. Thus, the following inequality results 

2-2y/F{p,a)< \\p-a\\ v 

and the lower bound in the statement of the theorem follows. 



(9.151) 



□ 



The following two corollaries are simple consequences of Theorem 9.3.1 
Corollary 9.3.1. Suppose that p is e-close to a in trace distance: 

\\p-<t\\i < e - 
Then the fidelity between p and a is greater than 1 — e: 

F(p,a)>l-e. 

Corollary 9.3.2. Suppose the fidelity between p and a is greater than 1 — e: 

F(p,a)>l-e. 

Then p is 2^/e-close to a in trace distance: 

\\p — c||i < 2\/e. 



(9.152) 
(9.153) 
(9.154) 
(9.155) 



Exercise 9.3.1 Prove the following lower bound on the probability of error P e in a quantum 
hypothesis test to distinguish p from a: 



Pe>\(l- y/1 -F(p,a)). 



(9.156) 



(Hint: Recall the development in Section 9.1.4 
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9.4 Gentle Measurement 

The Gentle Measurement and Gentle Operator Lemmas are particular applications of The- 
orem 



9.3.1[ and they concern the disturbance of quantum states. We generally expect in 
quantum theory that certain measurements might disturb the state which we are measuring. 
For example, suppose a qubit is in the state |0). A measurement along the X axis gives 
+1 and —1 with equal probability while drastically disturbing the state to become either 
|+) or | — }, respectively. On the other hand, we might expect that the measurement does 
not disturb the state by very much if one outcome is highly likely. For example, suppose 
that we instead measure the qubit along the Z axis. The measurement returns + 1 with 
unit probability while causing no disturbance to the qubit. The below "Gentle Measurement 
Lemma" quantitatively addresses the disturbance of quantum states by demonstrating that 
a measurement with one outcome that is highly likely causes only a little disturbance to the 
quantum state that we measure (hence, the measurement is "gentle" or "tender"). 

Lemma 9.4.1 (Gentle Measurement). Consider a density operator p and a measurement 
operator A where < A < I . The measurement operator could be an element of a POVM. 
Suppose that the measurement operator A has a high probability of detecting state p: 

Tr{Ap} > 1 - e, (9.157) 

where 1 > e > (the probability of detection is high only if e is close to zero). Then the 
post-measurement state 

p' = r P A \ 9.158 

H Tr{Ap} v ; 

is 2y / e-c/ose to the original state p in trace distance: 

\\p-p'\\ 1 <2\/e. (9.159) 

Thus, the measurement does not disturb the state p by much if e is small. 
Proof. Suppose first that p is a pure state \if))(xf)\. The post-measurement state is then 

^MM (9 160) 

The fidelity between the original state \ip) and the post-measurement state above is as follows: 






2 

Al 



<V|A|V> / <V|A|^»> 



(9.161) 



> Mmi2 (9 162) 
- MAM (9 - lb2) 

= (V>|A|^) (9.163) 

> 1 - e. (9.164) 
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The first inequality follows because v A > A when A < 7. The second inequality follows 
from the hypothesis of the lemma. Now let us consider when we have mixed states p A and 



p' A . Suppose \ip) and \ip'} are respective purifications of p A and p' A , where 



W) 



RA 



I R ® \/A 



A \ i\RA 



\I R ®A A \ 



X RA 



(9.165) 



Then we can apply monotonicity of fidelity (Lemma 9.2.1) and the above result for pure 
states to show that 

F(p A ,p ,A ) > F(\^) RA , W) RA ) > 1 - e . (9.166) 



> 1-e. 

A jA\ 



We finally obtain the bound on the trace distance ||p — p L by exploiting Corollary 



9.3.2 



□ 



The following lemma is a variation on the Gentle Measurement Lemma that we sometimes 
exploit. 

Lemma 9.4.2 (Gentle Operator). Consider a density operator p and a measurement op- 
erator A where < A < 7. The measurement operator could be an element of a POVM. 
Suppose that the measurement operator A has a high probability of detecting state p: 



Tr{Ap} > 1 - e, 



(9.167) 



where 1 > e > (the probability is high only if t is close to zero). Then v Apv A is 2^/t-close 
to the original state p in trace distance: 



P 



VXpy/X < 2^/e. 



(9.168) 



Proof. Consider the following chain of inequalities: 
p- VXpVX 



< 
= Tr 



(i- Va + Va)p- VXpVX 

(j-\/A~)p + y/Xp(l-VA\ 
(7 - y/A^y/p- Vpj+Trj^^- Jp(l -VI 



(9.169) 
(9.170) 
(9.171) 



< WTr|(7-\/A) 2 p|Tr{p} + WTr{Ap}Tr|p(7-\/A) 2 | (9.172) 



< y/Tr{(I - A)p} + V / Tr{p(7 - A)} 

= 2 V / Tr{(/-A)p} 

<2^~e. 



(9.173) 
(9.174) 
(9.175) 
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The first inequality is the triangle inequality. The second equality follows from the definition 
of the trace norm and the fact that p is a positive operator. The second inequality is the 
Cauchy-Schwarz inequality for the Hilbert-Schmidt norm and any two operators A and B: 



Tv{A j B} < y/Tr{A^A}Tr{BW}. 



(9.176) 



The third inequality follows because (1 — y/x) < 1 — x for < x < 1, Tr{p} = 1, and 



Tr{Ap} < 1. The final inequality follows from applying (9.167) and because the square root 
function is monotone increasing. □ 

Exercise 9.4.1 Show that the Gentle Operator Lemma holds for subnormalized positive 
operators p (operators p such that Tr{p} < 1). 

Below is another variation on the Gentle Measurement Lemma that applies to ensembles 
of quantum states. 

Lemma 9.4.3 (Gentle Measurement for Ensembles). Let {p x ,Px} be an ensemble with aver- 
age p = ^2 x p x Px- Given a positive operator A with A < 7 and Tr{pA} > 1 — e where e < 1, 
then 

Y>Jk - VA^VA <2y/~e. (9.177) 

*— ' II i 

X 

Proof. We can apply the same steps in the proof of the Gentle Operator Lemma to get the 
following inequality: 



Pa 



VXp~VA 



< 4(1 - Tr{p x A}). (9.178) 

Taking the expectation over both sides produces the following inequality: 

Y,P*\P* - J&Px^m < 4(1 - Tr{pA}) (9.179) 

X 

< 4e. (9.180) 

Taking the square root of the above inequality gives the following one: 



'Eft 



Px - ^Ap x s/A < 2^~e. 
i 



(9.181) 



Concavity of the square root implies then implies the result: 



J^PxW px-^Apx^A <2\fe 



(9.182) 

□ 
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Exercise 9.4.2 (Coherent Gentle Measurement) Let {p A } be a collection of density 
operators and {A&} be a POVM such that for all k: 



Tr{A£p£}>l-e. (9.183) 



^ A U„ „ «„^X!„„4.J * ^ 



Let |0jt) ' be a purification of p A . Show that there exists a coherent gentle measurement 



jjA^ak ^ n ^g sense f Section 15.41 such that 

V A ^ AK {^ A )-^ A ®\k){k\ K <2^~t. (9.184) 

(Hint: Use the result of Exercise 5.4. 1[) 



9.5 Fidelity of a Noisy Quantum Channel 

It is useful to have measures that determine how well a noisy quantum channel J\f preserves 
quantum information. We developed static distance measures, such as the trace distance 
and the fidelity, in the previous sections of this chapter. We would now like to exploit those 
measures in order to define dynamic measures. 

A "first guess" measure of this sort is the minimum fidelity F m i n (J\f) where 

F min (A0 = miiiF(^), jV(l^)^D). (9.185) 

lw 

This measure seems like it may be a good one because we generally do not know the state 
that Alice inputs to a noisy channel before transmitting to Bob. 

It may seem somewhat strange that we chose to minimize over pure states in the definition 
of the minimum fidelity Are not mixed states the most general states that occur in the 



quantum theory? It turns out that joint concavity of the fidelity (Property 9.2.2) implies 
that we do not have to consider mixed states for the minimum fidelity Consider the following 
sequence of inequalities: 

F(p,M(p)) = F\^p x (x)\x)(x\,Af\^p x (x)\x)(x\\ J (9.186) 

= f(J2 Px (x)\x)(x\,J2px(xW{\x){x\)) (9.187) 

\ x x / 

> J2px(x)F(\x)(x\,M(\x)(x\)) (9.188) 

x 

> ^(kmin)(^min|,A/'(|2; min }(a; min |)). (9.189) 

The first equality follows by expanding the density operator p with the spectral decompo- 
sition. The second equality follows from linearity of the quantum operation J\f. The first 



inequality follows from joint concavity of the fidelity (Property 9.2.2 ), and the last inequality 
follows because there exists some pure state \x m m) (one of the eigenstates of p) with fidelity 
never larger than the expected fidelity in the previous line. 
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9.5.1 Expected Fidelity 

In general, the minimum fidelity is less useful than other measures of quantum information 
preservation over a noisy channel. The difficulty with the minimum fidelity is that it requires 
an optimization over the potentially large space of input states. Since it is somewhat difficult 
to manipulate and compute in general, we introduce other ways to determine the performance 
of a noisy quantum channel. 

We can simplify our notion of fidelity by instead restricting the states that Alice sends 
and averaging the fidelity over this set of states. That is, suppose that Alice is transmitting 
states from an ensemble {px(x), p x } and we would like to determine how well a noisy quantum 
channel M is preserving this source of quantum information. Sending a particular state p x 
through a noisy quantum channel M produces the state N(p x ). The fidelity between the 
transmitted state p x and the received state N(p x ) is F(p x ,J\f(p x )) as defined before. We 
define the expected fidelity of the ensemble as follows: 

F(A0 = E x [F(p x ,X(p x ))} (9.190) 

= J2Px(x)F(p x ,Af( Px )). (9.191) 

x 

The expected fidelity indicates how well Alice is able to transmit the ensemble on average 
to Bob. It again lies between zero and one, just as the usual fidelity does. 

A more general form of the expected fidelity is to consider the expected performance 
for any quantum state instead of restricting ourselves to an ensemble. That is, let us fix 
some quantum state \ip) and apply a random unitary U to it, where we select the unitary 
according to the Haar measure (this is the uniform distribution on unitaries). The state 
U\ip) represents a random quantum state and we can take the expectation over it in order 
to define the following more general notion of expected fidelity: 

F{M) =Eu[F(U\iP),X{U\^){^))], (9.192) 

The above formula for the expected fidelity then becomes the following integral over the 
Haar measure: 

F(A0 = [(i}\uW(u\i))(i)\u j )u\i)) du. (9.193) 

9.5.2 Entanglement Fidelity 

We now consider a different measure of the ability of a noisy quantum channel to preserve 
quantum information. Suppose that Alice would like to transmit a quantum state with 
density operator p A . It admits a purification \ip) to a reference system R. Sending the A 
system of \ij)) through the identity channel I A ^ B gives a state \'ip) where B is a system 
that is isomorphic to A. Sending the A system of \ip) through quantum channel J\f gives 
the state a RB = (l R (g) Af A ^ B ) (^ RA ) . The entanglement fidelity is as follows: 

F e (p,Af) = M<#), (9.194) 
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Figure 9.4: The entanglement fidelity compares the output of the ideal scenario (depicted on the left) and 
the output of the noisy scenario (depicted on the right). 



and it is a measure of how well the noisy channel can preserve the entanglement with another 



system. Figure 9^ visually depicts the two states that the entanglement fidelity compares. 

One of the benefits of considering the task of entanglement preservation is that it implies 
the task of quantum communication. That is, if Alice can devise a protocol that preserves 
the entanglement with another system, then this same protocol will also be able to preserve 
quantum information that she transmits. 

The following theorem gives a simple way to represent the entanglement fidelity in terms 
of the Kraus operators of a given noisy quantum channel. 

Theorem 9.5.1. Given a quantum channel Af with Kraus operators A m , the entanglement 
fidelity F e (p,J\f) is equal to the following expression: 



F e (p,A0 = ]T|Tr{A4 m }| 2 . 



(9.195) 



Proof. Suppose the spectral decomposition of p A is p A = ^2 i Xi\i)(i\ . Its purification is then 
|^) =X)tVA|*) K) where {\i) } is some orthonormal basis on the reference system. The 
entanglement fidelity is then as follows: 

(ilj\(l R ®Af A )(ilj RA )\4>) 



i,j,k,l,m 

J2 V^^im R ^\ A Ai\j) A (k\i) R (k\ A A^\iy 

i,j,k,l,m 

y: ^MA A A A ji) A {k\ A A^\ k ) A 

i,k,m 

^A ? Tr{|z)( i | A ^}A fc Tr{|A;)(fc| A ^} 

i,k,m 

£'M/An}'M/4n} 

in 

]T|Tr{p^4 m }| 2 . 



(9.196) 
(9.197) 
(9.198) 
(9.199) 
(9.200) 
(9.201) 
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Exercise 9.5.1 Show that the entanglement fidelity is convex in the input state: 
F e (X Pl + (1 - A)p 2 ,A0 < AF e ( Pl ,A0 + (1 - X)F e (p 2 ,Af). 



□ 



(9.202) 



(Hint: The result of Theorem 9.5.1 is useful here.) 



9.5.3 Relationship between Expected Fidelity and Entanglement 
Fidelity 

The entanglement fidelity and the expected fidelity provide seemingly different methods for 
quantifying the ability of a noisy quantum channel to preserve quantum information. Is 
there any way that we can show how they are related? 

It turns out that they are indeed related. First, consider that the entanglement fidelity 
is a lower bound on the channel's fidelity for preserving the state p: 



F e (p,M)<F(p,M(p)). 



(9.203) 



The above result follows simply from the monotonicity of fidelity under partial trace (Lemma 



We can show that the entanglement fidelity is always less than the expected fidelity in (9.190) 



by combining convexity of entanglement fidelity (Exercise 9.5.1) and the bound in (9.203): 



9.2.1). 



F e \Y,p x {x)p x ,M\ <J2Px(x)F e (p x ,Af) 

\ X / X 

<J2Px(x)F( Px ,Af(Px)) 



FCAO 



(9.204) 
(9.205) 
(9.206) 



Thus, any channel that preserves entanglement with some reference system preserves the 
expected fidelity of an ensemble. In most cases, we only consider the entanglement fidelity 
as the defining measure of performance of a noisy quantum channel. 

The relationship between entanglement fidelity and expected fidelity becomes more exact 
(and more beautiful) in the case where we select a random quantum state according to the 



Haar measure. It is possible to show that the expected fidelity in (9.192) relates to the 
entanglement fidelity as follows: 



F(A0 



cLF e (7r,AQ + l 
d+1 



(9.207) 



where d is the dimension of the input system and it is the maximally mixed state with 
purification to the maximally entangled state. 



Exercise 9.5.2 Prove that the relation in (9.207) holds for a quantum depolarizing channel. 
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9.6 The Hilbert-Schmidt Distance Measure 

One final distance measure that we develop is the Hilbert-Schmidt distance measure. It 
is most similar to the familiar Euclidean distance measure of vectors because an ^2-norm 
induces it. This distance measure does not have an appealing operational interpretation like 
the trace distance or fidelity do, and so we do not employ it to compare quantum states. 
Nevertheless, it can sometimes be helpful analytically to exploit this distance measure and 
to relate it to the trace distance via the bound in Exercise 19.6.11 below . 
Let us define the Hilbert-Schmidt norm of an operator M as follows: 

||M|| 2 = ^TrjMtM}. (9.208) 

It is straightforward to show that the above norm meets the three requirements of a norm: 
positivity, homogeneity, and the triangle inequality. One can compute this norm simply by 
summing the squares of the eigenvalues of the operator M: 

\\M\\ 2 = ^|/i,| 2 , (9.209) 

X 

where M admits the spectral decomposition ^2 X p x \x)(x\. 

The Hilbert-Schmidt norm induces the following Hilbert-Schmidt distance measure: 



\M 



N\\ 2 = JTtUm-N)\M-N)\. (9.210) 



We can of course then apply this distance measure to quantum states p and a simply by 
plugging p and a into the above formula in place of M and N. 

The Hilbert-Schmidt distance measure sometimes finds use in the proofs of coding theo- 
rems in quantum Shannon theory because it is often easier to find good bounds on it rather 
than on the trace distance. In some cases, we might be taking expectations over ensembles of 
density operators and this expectation often reduces to computing variances or covariances. 

Exercise 9.6.1 Show that the following inequality holds for any normal operator X 

\\X\\\ <d\\X\\ 2 2 , (9.211) 

where d is the dimension of the support of X. (Hint: use the convexity of the square 
function.) 

9.7 History and Further Reading 



Fuchs' thesis |96] and research paper J97] are a good starting point for learning more regarding 
trace distance and fidelity. Other notable sources are the book of Nielsen and Chuang [197J, 
Yard's thesis |266j . and Kretschmann's thesis |242j . Helstrom demonstrated the operational 
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interpretation of the trace distance in the context of quantum hypothesis testing |138|, [139J. 
Uhlmann first proved his theorem in Ref . (240j , and Jozsa later simplified the proof of this 
theorem in Ref. |165j . Schumacher introduced the entanglement fidelity in Ref. [217] , and 
Barnum et al. made further observations regarding it in Ref. [15J. Nielsen provided a simple 
proof of the exact relation between entanglement fidelity and expected fidelity |196] . 

Winter originally proved the "Gentle Measurement" Lemma in Ref. [255J and in his the- 
sis [256J. There, he used it to obtain a variation of the direct part of the HSW coding 
theorem. Later, he used it to prove achievable rates for the quantum multiple access chan- 
nel |257j . Ogawa and Nagaoka subsequently improved this bound to 2y/e in Appendix C of 
Ref. [199] . 
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CHAPTER 10 



Classical Information and Entropy 



All physical systems register bits of information, whether it be an atom, an electrical current, 
the location of a billiard ball, or a switch. Information can be classical, quantum, or a hybrid 
of both, depending on the system. For example, an atom or an electron or a superconducting 
system can register quantum information because the quantum theory applies to each of 
these systems, but we can safely argue that the location of a billiard ball registers classical 
information only. These atoms or electrons or superconducting systems can also register 
classical bits because it is always possible for a quantum system to register classical bits. 

The term information, in the context of information theory, has a precise meaning that 
is somewhat different from our prior "every day" experience with it. Recall that the notion 
of the physical bit refers to the physical representation of a bit, and the information bit is 
a measure of how much we learn from the outcome of a random experiment. Perhaps the 
word "surprise" better captures the notion of information as it applies in the context of 
information theory. 

This chapter begins our formal study of classical information. Recall that Chapter [2] 
overviewed some of the major operational tasks in classical information theory. Here, our 
approach is somewhat different because our aim is to provide an intuitive understanding 
of information measures, in terms of the parties who have access to the classical systems. 
We define precise mathematical formulas that measure the amount of information encoded 
in a single physical system or in multiple physical systems. The advantage of developing 
this theory is that we can study information in its own right without having to consider the 
details of the physical system that registers it. 



We first introduce the entropy in Section 10.1 as the expected surprise of a random 



variable. We extend this basic notion of entropy to develop other measures of information 



in Sections 10.2 10.6 that prove useful as intuitive informational measures, but also, and 
perhaps more importantly, these measures are the answers to operational tasks that one 
might wish to perform with noisy resources. While introducing these quantities, we dis- 
cuss and prove several mathematical results concerning them that are important tools for 
the practicing information theorist. These tools are useful both for proving results and for 



increasing our understanding of the nature of information. Section |10.7| introduces informa- 
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tion inequalities that help us to understand the limits on our ability to process information. 



Section |10.8| ends the chapter by applying the classical informational measures developed in 
the forthcoming sections to the classical information that one can extract from a quantum 
system. 

10.1 Entropy of a Random Variable 

Consider a random variable X. Each realization x of random variable X belongs to an 
alphabet X. Let px(x) denote the probability density function of X so that px(x) is the 
probability that realization x occurs. The information content i(x) of a particular realiza- 
tion x is a measure of the surprise that one has upon learning the outcome of a random 
experiment: 

i{x) = -log{p x {x)). (10.1) 

The logarithm is base two and this choice implies that we measure surprise or information 
in bits. 



Figure 10.1 plots the information content for values in the unit interval. This measure 
of surprise behaves as we would hope — it is higher for lower probability events that surprise 
us, and it is lower for higher probability events that do not surprise us. Inspection of the 
figure reveals that the information content is positive for any realization x. 

The information content is also additive, due to the choice of the logarithm function. 
Given two independent random experiments involving random variable X with respective 
realizations X\ and X2, we have that 

i(x 1 ,x 2 ) = -log(p x ,x(xi,x 2 )) = -log(p x (xi)p x (x 2 )) = i(xi) +i(x 2 ). (10.2) 

Additivity is a property that we look for in measures of information (so much so that we 



dedicate the whole of Chapter 12 to this issue for more general measures of information). 

The information content is a useful measure of surprise for particular realizations of 
random variable X, but it does not capture a general notion of the amount of surprise 
that a given random variable X possesses. The entropy H(X) captures this general notion 
of the surprise of a random variable X — it is the expected information content of random 
variable X: 

H(X)=E x {i(X)}. (10.3) 

At a first glance, the above definition may seem strangely self-referential because the argu- 
ment of the probability density function px{%) is itself the random variable X, but this is 
well-defined mathematically. Evaluating the above formula gives the following expression for 
the entropy H(X): 

H(X) = - J>x(x) log( Px (x)). (10.4) 

X 

We adopt the convention that • log(0) = for realizations with zero probability. The fact 
that lim e ^o e " l°g( e ) = intuitively justifies this latter convention. 
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Figure 10.1: The information content or "surprise" in (10.1 1 as a function of a probability p ranging from 



to 1 . An event has a lower surprise if it is more likely to occur and it has a higher surprise if it less likely 
to occur. 



The entropy admits an intuitive interpretation. Suppose that Alice generates a random 
experiment in her lab that selects a realization x according to the density px{%) of random 
variable X . Suppose further that Bob has not yet learned the outcome of the experiment. 
The interpretation of the entropy H(X) is that it quantifies Bob's uncertainty about X 
before learning it — his expected information gain is H(X) bits upon learning the outcome of 
the random experiment. Shannon's noiseless coding theorem, described in Chapter [2j makes 
this interpretation precise by proving that Alice needs to send Bob bits at a rate H(X) 
in order for him to be able to decode a compressed message. Figure 10.2 a) depicts the 



interpretation of the entropy H(X), along with a similar interpretation for the conditional 



entropy that we introduce in Section 10.2 



10.1.1 The Binary Entropy Function 

A special case of the entropy occurs when the random variable A is a Bernoulli random 
variable with probability density px(0) = p and px(l) = I — p. This Bernoulli random 
variable could correspond to the outcome of a random coin flip. The entropy in this case is 
known as the binary entropy function: 



H{p) = -plogp - (1 - p) log(l - p). 



(10.5) 



It quantifies the number of bits that we learn from the outcome of the coin flip. If the coin 
is unbiased (p = 1/2), then we learn a maximum of one bit (H(p) = 1). If the coin is 
deterministic {jp = or p = 1), then we do not learn anything from the outcome (H(p) = 0). 
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Figure 10.2: (a) The entropy H(X) is the uncertainty that Bob has about random variable X before 
learning it. (b) The conditional entropy H(X\Y) is the uncertainty that Bob has about X when he already 
possesses Y. 




1 2 0.3 0.4 0.5 0.6 0.7 0.9 1 



Figure 10.3: The binary entropy function H (p) displayed as a function of the parameter p. 



Figure [10.3| displays a plot of the binary entropy function. The figure reveals that the binary 
entropy function H(p) is a concave function of the parameter p and has its peak at p = 1/2. 



10.1.2 Mathematical Properties of Entropy 

We now discuss five important mathematical properties of the entropy H(X). 

Property 10.1.1 (Positivity) The entropy H(X) is non-negative for any probability den- 
sity p x {x): 

H(X) > 0. (10.6) 

Proof. Positivity follows because entropy is the expected information content i(x), and the 
information content itself is positive. It is perhaps intuitive that the entropy should be 
positive because positivity implies that we always learn some number of bits upon learning 
random variable X (if we already know beforehand what the outcome of a random experiment 
will be, then we learn zero bits of information once we perform it). In a classical sense, we 
can never learn a negative amount of information! □ 
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Property 10.1.2 (Concavity) The entropy H(X) is concave in the probability density 
Px(x). 

Proof. We justify this result with a heuristic "mixing" argument for now, and provide a for- 



mal proof in Section 10.7.1 Consider two random variables X\ and X2 with two respective 
probability density functions px l (x) and px 2 {%) whose realizations belong to the same alpha- 
bet. Consider a Bernoulli random variable B with probabilities q and 1 — q corresponding to 
its two respective realizations b = 1 and b = 2. Suppose that we first generate a realization b 
of random variable B and then generate a realization x of random variable Xj,. Random 
variable Xb then denotes a mixed version of the two random variables X\ and X2. The 
probability density of X B is px B (x) = qpxA x ) + (1 — q)Px 2 ( x )- Concavity of entropy is the 
following inequality: 

H(X B )>qH(X x ) + (l-q)H(X 2 ). (10.7) 

Our heuristic argument is that this mixing process leads to more uncertainty for the mixed 
random variable Xb than the expected uncertainty over the two individual random variables. 
We can think of this result as a physical situation involving two gases. Two gases each have 
their own entropy, but the entropy increases when we mix the two gases together. We later 
give a more formal argument to justify concavity. □ 

Property 10.1.3 (Invariance under permutations) The entropy is invariant under per- 
mutations of the realizations of random variable X. 

Proof. That is, suppose that we apply some permutation it to realizations X\, 22, • • • , x\x\ 
so that they respectively become 7r(xi), 7r(x 2 ), . . . , 7r(x|^|). Then the entropy is invariant 
under this shuffling because it depends only on the probabilities of the realizations, not the 
values of the realizations. □ 

Property 10.1.4 (Minimum value) The entropy vanishes for a deterministic variable. 

Proof. We would expect that the entropy of a deterministic variable should vanish, given the 
interpretation of entropy as the uncertainty of a random experiment. This intuition holds 
true and it is the degenerate probability density px{%) = 5 XyXo , where the realization xq has 
all the probability and other realizations have vanishing probability, that gives the minimum 
value of the entropy: H(X) =0 when X has a degenerate density. □ 

Sometimes, we may not have any prior information about the possible values of a variable 
in a system, and we may decide that it is most appropriate to describe them with a probability 
density function. How should we assign this probability density if we do not have any prior 
information about the values? Theorists and experimentalists often resort to a "principle of 
maximum entropy" or a "principle of maximal ignorance" — we should assign the probability 
density to be the one that maximizes the entropy. 

Property 10.1.5 (Maximum value) The maximum value of the entropy H(X) for a 
random variable X with d different realizations is log d: 

H(X)<logd. (10.8) 
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Proof. For any variable with a finite number of values, the probability density that maxi- 
mizes the entropy is the uniform distribution. This distribution results in an entropy log d, 



where d is the number of values for the variable (the result of Exercise 2.1.1 is that logo? 
is the entropy of the uniform random variable). We can prove the above inequality with a 
simple Lagrangian optimization by solving for the density px (x) that maximizes the entropy. 
Lagrangian optimization is well-suited for this task because the entropy is concave in the 
probability density, and thus any local maximum will be a global maximum. The Lagrangian 
£ is as follows: 

C = H(X) + \(J2px(x)-iY (10.9) 

where H(X) is the quantity that we are maximizing, subject to the constraint that the 
probability density px(x) sums to unity. The partial derivative d ac , x) is as follows: 

dC 

dp x {x) 

= ^|^(-E^( a;, )log(px(x / )) + Af^px(x / )-ljj (10.10) 

= -\og{p x {x))-l + \ (10.11) 

We null the partial derivative Q ac (x) to find the density that maximizes C: 

= -log(p x (x))-l + \ (10.12) 

^px(x) = 2 x -\ (10.13) 

The resulting density Px(x) is dependent only on a constant A, implying that it must be 
uniform px(x) = \- Thus, the uniform distribution i maximizes the entropy H(X) when 
random variable X is finite. □ 



10.2 Conditional Entropy 



Let us now suppose that Alice possesses random variable X and Bob possesses some other 
random variable Y . Random variables X and Y share correlations if they are not statistically 
independent, and Bob then possesses "side information" about X in the form of Y . Let i(x\y) 
denote the conditional information content: 

i(x\y) = -log(p X \Y(x\y)). (10.14) 

The entropy H(X\Y = y) of random variable X conditional on a particular realization y of 
random variable Y is the expected conditional information content, where the expectation 
is with respect to X: 

H(X\Y = y)=E x {t(X\y)} (10.15) 

= -J2px\Y(x\y)log(p X \Y(x\y)). (10.16) 
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The relevant entropy that applies to the scenario where Bob possesses side information is 
the conditional entropy H(X\Y). It is the expected conditional information content where 
the expectation is with respect to both X and Y: 

H(X\Y) = E xx {t(X\Y)} (10.17) 

= Y J PY(y)H(X\Y = y) (10.18) 

y 

= -J2pY{y)J2px\Y{x\y)log(p X \Y(x\y)) (10.19) 

y x 

= -^2px,Y(x,y)log(p X \Y(x\y)). (10.20) 

x,y 

The conditional entropy H(X\Y) as well deserves an interpretation. Suppose that Alice 
possesses random variable X and Bob possesses random variable Y. The conditional en- 
tropy H(X\Y) is the amount of uncertainty that Bob has about X given that he already 
possesses Y. Figure 10.2[b) depicts this interpretation. 



The above interpretation of the conditional entropy H(X\Y) immediately suggests that 
it should be less than or equal to the entropy H(X). That is, having access to a side 
variable Y should only decrease our uncertainty about another variable. We state this idea 



as the following theorem and give a formal proof in Section 10.7.1 



Theorem 10.2.1 (Conditioning does not increase entropy). The entropy H(X) is greater 
than or equal to the conditional entropy H{X\Y): 

H(X)>H(X\Y). (10.21) 

Positivity of conditional entropy follows from positivity of entropy because conditional 
entropy is the expectation of the entropy H(X\Y = y) with respect to the density py{y)- It 
is again intuitive that conditional entropy should be positive. Even if we have access to some 
side information Y, we always learn some number of bits of information upon learning the 
outcome of a random experiment involving X. Perhaps strangely, we will see that quantum 
conditional entropy can become negative, defying our intuition of information in the classical 
sense given here. 

10.3 Joint Entropy 

What if Bob knows neither X nor Yl The natural entropic quantity that describes his 
uncertainty is the joint entropy H(X,Y). The joint entropy is merely the entropy of the 
joint random variable (X,Y): 

H(X,Y)=M x>Y {i(X,Y)} (10.22) 

= -^2px,Y(x,y)log(p x ,Y(x,y)). (10.23) 

x,y 
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The following exercise asks you to explore the relation between joint entropy H(X, Y), con- 
ditional entropy H(Y\X), and marginal entropy H(X). Its proof follows by considering that 
the multiplicative probability relation px,y(x, v) = Py\x(v\x)Px(x) of joint probability, con- 
ditional probability, and marginal entropy becomes an additive relation under the logarithms 
of the entropic definitions. 

Exercise 10.3.1 Verify that H(X, Y) = H(X) + H(Y\X) = H(Y) + H(X\Y). 

Exercise 10.3.2 Extend the result of Exercise |10.3.1| to prove the following chaining rule 
for entropy: 



H(X U ...,X n ) = H(X 1 ) + H(X 2 \X l ) + ■■■ 
Exercise 10.3.3 Prove that entropy is subadditive: 



H(X n \X n - 



i) 



. Ai 



H(X 



i- 



X n ) < 



!>(*«), 



i=l 



by exploiting Theorem 10.2.1 and the entropy chaining rule in Exercise 10.3.2 



Exercise 10.3.4 Prove that entropy is additive when the random variables A l7 
independent: 



H(X u ... 1 X n ) = Y,H{X l ). 



»=i 



10.4 Mutual Information 



(10.24) 



(10.25) 



, X n are 



(10.26) 



We now introduce an entropic measure of the common or mutual information that two 
parties possess. Suppose that Alice possesses random variable X and Bob possesses random 
variable Y. The mutual information is the marginal entropy H(X) less the conditional 

entropy H(X\Y): 

I(X;Y) = H(X)-H(X\Y). (10.27) 

It quantifies the dependence or correlations of the two random variables X and Y. 

The mutual information measures how much knowing one random variable reduces the 
uncertainty about the other random variable. In this sense, it is the common information 
between the two random variables. Bob possesses Y and thus has an uncertainty H(X\Y) 
about Alice's variable X . Knowledge of Y gives an information gain of H(X\Y) bits about 
X and then reduces the overall uncertainty H(X) about A, the uncertainty were he not to 
have any side information at all about A. 



Exercise 10.4.1 Show that the mutual information is symmetric in its inputs: 



implying additionally that 



I(X;Y) = I(Y;X), 



I(X;Y) = H(Y) - H(Y\X). 



(10.28) 
(10.29) 
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We can also express the mutual information I(X; Y) in terms of the respective joint and 
marginal probability density functions px,y{x,y) and px(x) and Py{u)'- 






■<:■!) 



Px,r(x,y) 
Px{x)p Y (y) 



I(X;Y) = ^p x , Y (x,y)\og[ *yj_ ' ^ J . (10.30) 



The above expression leads to two insights regarding the mutual information I(X; Y). Two 
random variables X and Y possess zero bits of mutual information if and only if they are 
statistically independent (recall that the joint density factors as px,y{x,y) = Px(x)py{y) 
when X and Y are independent). That is, knowledge of Y does not give any information 
about X when the random variables are statistically independent. Also, two random vari- 
ables possess H(X) bits of mutual information if they are perfectly correlated in the sense 
that Y = X. 



Theorem 10.4.1 below states that the mutual information I(X;Y) is non-negative for 



any random variables X and Y — we provide a formal proof in Section 10.7.1. Though, this 



follows naturally from the definition of mutual information in (10.27) and "conditioning does 



not increase entropy" (Theorem 10.2.1) 



Theorem 10.4.1. The mutual information I(X; Y) is non-negative for any random variables X 
and Y : 

I(X;Y)>0. (10.31) 



10.5 Relative Entropy 



The relative entropy is another important entropic quantity that quantifies how "far" one 
probability density function px 1 {x) is from another probability density function px 2 {x). We 
define the relative entropy D(px 1 \\px 2 ) as follows: 

D(p Xi \\px 2 ) = £pxi(s)tog(;H^y (ia32) 

x \Px 2 {x ) / 

According to the above definition, the relative entropy is an expected log-likelihood ratio of 
the densities Pxi(x) and px 2 (x). 

The above definition implies that the relative entropy is not symmetric under interchange 
of the densities Pxi( x ) an d Px 2 (x). Thus, the relative entropy is not a distance measure in 
the strict mathematical sense because it is not symmetric (nor does it satisfy a triangle 
inequality). 

The relative entropy has an interpretation in source coding. Suppose that an information 
source generates a random variable X\ according to the density px 1 (x). Suppose further that 
Alice (the compressor) mistakenly assumes that the probability density of the information 
source is instead px 2 {x) and codes according to this density. Then the relative entropy quan- 
tifies the inefficiency that Alice incurs when she codes according to the mistaken probability 

©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



256 CHAPTER 10. CLASSICAL INFORMATION AND ENTROPY 



density — Alice requires H{X X ) + D(p Xl \\px 2 ) bits on average to code (whereas she would 
only require H(Xi) bits on average to code if she used the true density px^x)). 

We might also see now that the mutual information I(X; Y) is equivalent to the relative 



entropy D(pxy{x,y)\\px{%)pY{y)) by comparing the definition of relative entropy in (10.32) 



and the expression for the mutual information in (10.30). In this sense, the mutual infor- 
mation quantifies how far the two random variables X and Y are from being independent 
because it calculates the distance of the joint density Px,y( x ^v) fr° m the product of the 
marginals Px(x)py{v)- 

The relative entropy D(px 1 \\px 2 ) admits a pathological property. It can become infinite 
if the distribution pxi(%i) does not have all of its support contained in the support of 
Px 2 { x 2) (i- e - ? if there is some realization x for which px l {x) ^ but px 2 { x ) = 0)- This 
can be somewhat bothersome if we like this interpretation of relative entropy as a notion of 
distance. In an extreme case, we would think that the distance between a deterministic binary 
random variable X2 where Pr{X2 = 1} = 1 and one with probabilities Pr{Ai = 0} = e and 
Pr{Xi = 1} = 1 — e should be on the order of e (this is true for the Komolgorov distance). 
Though, the relative entropy D(p Xl \\px 2 ) hi this case is infinite, in spite of our intuition that 
these distributions are close. The interpretation in lossless source coding is that it would 
require an infinite number of bits to code a distribution px 1 losslessly if Alice mistakes it 
as px 2 - Alice thinks that the symbol X2 = never occurs, and in fact, she thinks that the 
typical set consists of just one sequence of all ones and every other sequence is atypical. But 
in reality, the typical set is quite a bit larger than this, and it is only in the limit of an 
infinite number of bits that we can say her compression is truly lossless. 

10.6 Conditional Mutual Information 

What is the common information between two random variables X and Y when we have 
some side information embodied in a random variable Zl The entropic quantity that answers 
this question is the conditional mutual information. It is simply the mutual information 
conditioned on a random variable Z: 

I(X;Y\Z) = H(Y\Z)-H(Y\X,Z) (10.33) 

= H(X\Z)-H(X\Y,Z) (10.34) 

= H(X\Z) + H(Y\Z)-H(X,Y\Z). (10.35) 

Theorem 10.6.1 (Strong subadditivity). The conditional mutual information I(X; Y\Z) is 
non-negative: 

I(X;Y\Z) > 0. (10.36) 

Proof. The proof of the above theorem is a straightforward consequence of the positivity of 



mutual information (Theorem 10.4.1). Consider the following equivalence: 



I(X- Y\Z) = J2 Pz(z)I(X] Y\Z = z), (10.37) 
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where I(X; Y\Z = z) is a mutual information with respect to the joint density Px,y\z(x, y\ z ) 
and the marginal densities px\z(x\z) and Py\z(v\z). Positivity of I(X;Y\Z) then follows 
from positivity of pz(z) and I(X; Y\Z = z). □ 

The proof of the above classical version of strong subadditivity is perhaps trivial in 
hindsight (it requires only a few arguments). The proof of the quantum version of strong 
subaddivity is highly nontrivial on the other hand. We discuss strong subadditivity of 
quantum entropy in the next chapter. 

Theorem 10.6.2. The conditional mutual information vanishes if random variables X and 
Y are conditionally independent through Z . That is, 

I(X;Y\Z) = 0, (10.38) 

ifPx,Y\z(x,y\z) =Px\z(x\z)p Y \z{y\z). 

Proof. We can establish the proof by expressing the conditional mutual information in a 



form similar to that for the mutual information in (10.30): 



I(X;Y\Z) = J2p X ,Y\ Z M*)log( PXW&VW V (10.39) 

~ \PX\z{x\z)pY\z[y\z) J 

The logarithm then vanishes when px,Y\z(x,y\z) factors as Px\z(x\z)py\z(v\z) ■ □ 



Exercise 10.6.1 The expression in (10.36) represents the most compact way to express the 
strong subadditivity of entropy. Show that the following inequalities are equivalent ways of 
representing strong subadditivity: 

H(XY\Z) <H(X\Z) + H(Y\Z), (10.40) 

H(XYZ) + H(Z) <H(XZ) + H(YZ), (10.41) 

H(X\YZ) < H(X\Z). (10.42) 

Exercise 10.6.2 Prove the following chaining rule for mutual information: 

I(X\, . . . ,X n ; Y) 

= /(X i; Y) + I(X 2 ; Y\X{) + ■■■ + I(X n ; Y\X h . . . X n ^). (10.43) 

10.7 Information Inequalities 

The entropic quantities introduced in the previous sections each have bounds associated with 
them. These bounds are fundamental limits on our ability to process and store information. 
We introduce three bounds in this section: the fundamental information inequality, the data 
processing inequality, and Fano's inequality. Each of these inequalities plays an important 
role in information theory, and we describe these roles in more detail in the forthcoming 
subsections. 
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Figure 10.4: A plot that compares the functions In a; and x — 1, revealing that lnx < x — 1 for all positive x. 



10.7.1 The Fundamental Information Inequality 

The fundamental information inequality is the statement that the relative entropy is always 
non-negative. This seemingly innocuous result has several important implications — namely, 



the maximal value of entropy, conditioning does not increase entropy (Theorem 10.2.1), pos- 



itivity of mutual information (Theorem 10.4.1), and strong subadditivity (Theorem 10.6.1) 



are straightforward corollaries of it. The proof of the fundamental information inequality 
follows from the application of a simple inequality: lnx < x — 1. 



Theorem 10.7.1 (Positivity of Relative Entropy). The relative entropy D(p Xl \\px 2 ) ^ non- 
negative for any probability density functions PxA%) and Px 2 ( x ) : 



D(p Xl \\px 2 )>0. 



(10.44) 



Proof. The proof relies on the inequality lnx < x — 1 that holds for all positive x and 



saturates for x = 1. Figure [10.4| plots these functions. We prove the theorem by application 
of the following chain of inequalities: 



D {PxA\px 2 ) = ^2p Xl ( x ) l °g 



Px, (x) 



Px 2 (x) 



Px 2 { 



h72^ (£)ln fe 



X 



(x) 



(10.45) 
(10.46) 
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p X2 (x)) (10.48) 



\ X X 



In! 

= 0. (10.49) 

The sole inequality follows because — lnx > 1—x (a simple rearrangement of \nx < x— 1). □ 
We can now quickly prove several corollaries of the above theorem. Recall in Sec- 



tion 



10.1.2 that we proved that the entropy H(X) takes the maximal value logrf, where d 
is size of the alphabet of X . The proof method involved Lagrange multipliers. Here, we 
can prove this result simply by computing the relative entropy D(p x (x)\\^), where Px(%) 
is the probability density of X and k is the uniform density, and applying the fundamental 
information inequality: 

Q<d(p x (x)\\^\ (10.50) 

= E^(*)iog(^#) ( 10 - 51 ) 

= -H(X) + J2 Px{x) log d (10.52) 

X 

= -H(X) + logd. (10.53) 

It then follows that H(X) < logd by combining the first line with the last. 



Positivity of mutual information (Theorem 10.4.1) follows by recalling that 

I(X;Y) = D(p XtY (x,y)\\ Px (x)p Y (y)) (10.54) 

and applying the fundamental information inequality. Conditioning does not increase en- 



tropy (Theorem 10.2.1) follows by noting that I(X;Y) = H(X) — H(X\Y) and applying 
Theorem 110.4.11 

10.7.2 Data Processing Inequality 

Another important inequality in classical information theory is the data processing inequality. 
This inequality states that correlations between random variables can only decrease after we 
process one variable according to some stochastic function that depends only on that variable. 
The data processing inequality finds application in the converse proof of a coding theorem 
(the proof of the optimality of a communication rate) . 

We detail the scenario that applies for the data processing inequality. Suppose that 
we initially have two random variables X and Y. We might say that random variable 
Y arises from random variable X by processing X according to a stochastic map M\ = 
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X 


K 


Y 


K 


z 






(a) 



(b) 



Figure 10.5: Two slightly different depictions of the scenario in the data processing inequality, (a) The 
map A/i processes random variable X to produce some random variable Y, and the map A2 processes 
the random variable Y to produce the random variable Z . The inequality I(X; Y) > I(X; Z) applies here 
because correlations can only decrease after data processing, (b) This depiction of data processing helps us to 
build intuition for data processing in the quantum world. The protocol begins with two perfectly correlated 
random variables X and X' — perfect correlation implies that Px,X'( x , x ') = Px(%)3x,x' and further that 
H{X) = I(X; X'). We process random variable X' with a stochastic map A/i to produce a random variable 
Y , and then further process Y according to the stochastic map A2 to produce random variable Z. By the 
data processing inequality, the following chain of inequalities holds: I(X; X') > I(X; Y) > I(X; Z). 



Py\x(v\x)- That is, the two random variables arise by first picking X according to the density 
Px{x) and then processing X according to the stochastic map Mi- The mutual information 
I(X; Y) quantifies the correlations between these two random variables. Suppose then that 
we process Y according to some other stochastic map J\f 2 = Pz\y{z\v) to produce a random 
variable Z (note that the map can also be deterministic because the set of stochastic maps 
subsumes the set of deterministic maps). Then the data processing inequality states that 
the correlations between X and Z must be less than the correlations between X and Y: 



I(X;Y)>I(X;Z), 



;i0.55) 



because data processing according to any map A/2 can only decrease correlations. Fig- 
10.5[a) depicts the scenario described above. Figure 10.5[b) depicts a slightly different 



ure 



scenario for data processing that helps build intuition for the forthcoming notion of quan- 



tum data processing in Section [11.9.3| of the next chapter. Theorem |1 0. 7. 2| below states the 
classical data processing inequality. 

The scenario described in the above paragraph contains a major assumption and you may 
have picked up on it. We assumed that the stochastic map Pz\y{Av) that produces random 
variable Z depends on random variable Y only — it has no dependence on X. It then holds 
that 

Pz\Y,x(z\y,x)=p Z \ Y (z\y). (10.56) 

This assumption is called the Markovian assumption and is the crucial assumption in the 
proof of the data processing inequality. We say that the three random variables X, Y, 
and Z form a Markov chain and use the notation X —>■ Y —>■ Z to indicate this stochastic 
relationship. 
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Theorem 10.7.2 (Data Processing Inequality). Suppose three random variables X , Y , and 
Z form a Markov chain: X —> Y —> Z . Then the following data processing inequality applies 

I(X;Y)>I(X;Z). (10.57) 

Proof. The Markov condition X — > Y — > Z implies that random variables X and Z are 
conditionally independent through Y because 

Px,z\y(x,z\y) = Pz\Y,x{z\y,x)p X \ Y {x\y) (10.58) 

= Pz\Y(z\y)p X \Y(x\y). (10.59) 

We prove the data processing inequality by manipulating the mutual information I(X; YZ). 
Consider the following equalities: 

I(X;YZ) = I(X;Y) + I(X;Z\Y) (10.60) 

= I(X;Y). (10.61) 



The first equality follows from the chain rule for mutual information (Exercise 10.6.2). The 
second equality follows because the conditional mutual information I(X; Z\Y) vanishes for a 
Markov chain X — > Y — > Z — i.e., X and Z are conditionally independent through Y (recall 



Theorem 10.6.2). We can also expand the mutual information I{X] YZ) in another way to 
obtain 

I(X;YZ)=I(X;Z) + I(X;Y\Z). (10.62) 



Then the following equality holds for a Markov chain X —> Y — > Z by exploiting (10.61): 



I(X;Y) = I(X)Z) + I(X)Y\Z). (10.63) 



The inequality in Theorem 10.7.2 follows because I(X; Y\Z) is non-negative for any random 



variables X,Y, and Z (recall Theorem 10.6.1). □ 



Corollary 10.7.1. The following inequality holds for a Markov chain X — > Y — > Z : 

I(X;Y) > I(X;Y\Z). (10.64) 

Proof. The proof follows by inspecting the above proof. □ 

10.7.3 Fano's Inequality 

The last classical information inequality that we consider is Fano's inequality. This inequality 
also finds application in the converse proof of a coding theorem. 

Fano's inequality applies to a general classical communication scenario. Suppose Alice 
possesses some random variable X that she transmits to Bob over a noisy communication 
channel. Let Py\x{u\x) denote the stochastic map corresponding to the noisy communication 
channel. Bob receives a random variable Y from the channel and processes it in some way 
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X 



M — V 



X 



Figure 10.6: The classical communication scenario relevant in Fano's inequality. Alice transmits a random 
variable X over a noisy channel A/", producing a random variable Y. Bob receives Y and processes it according 
to some decoding map T> to produce his best estimate X of Y. 



to produce his best estimate X of the original random variable X. Figure 10.6 depicts this 
scenario. 

The natural performance metric of this communication scenario is the probability of 
error p e = PrlXj^X > — a low probability of error corresponds to good performance. On 

the other hand, consider the conditional entropy H(X\Y). We interpreted it before as the 
uncertainty about X from the perspective of someone who already knows Y. If the channel 
is noiseless (py\x(u\x) = S y>x ), then there is no uncertainty about X because Y is identical 
to X: 

H(X\Y) = 0. (10.65) 

As the channel becomes noisier, the conditional entropy H(X\Y) increases away from zero. 
In this sense, the conditional entropy H(X\Y) quantifies the information about X that is 
lost in the channel noise. We then might naturally expect there to be a relationship between 
the probability of error p e and the conditional entropy H(X\Y): the amount of information 
lost in the channel should be low if the probability of error is low. Fano's inequality provides 
a quantitative bound corresponding to this idea. 

Theorem 10.7.3 (Fano's Inequality). Suppose that Alice sends a random variable X through 
a noisy channel to produce random variable Y and further processing of Y gives an estimate 

X of X . Thus, X — > Y — > X forms a Markov chain. Let p e = Prs X ^X> denote 
the probability of error. Then the following function of the error probability p e bounds the 
information lost in the channel noise: 

H{X\Y) < H 2 {p e ) + p e \og{\X\ - 1), (10.66) 

where H 2 {p e ) is the binary entropy function. In particular, note that 

lim H 2 {p e ) + p e \og{\X\ - 1) = 0. (10.67) 

p e -»0 

Proof. Let E denote an indicator random variable that indicates whether an error occurs: 

Consider the entropy 

H(EX\X\ =h(x\x\ +h(e\XX\. (10.69) 
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The entropy H(E\XXj on the RHS vanishes because there is no uncertainty about the 
indicator random variable E if we know both X and X. Thus, 

h(,EX\x\ = h(x\x\ (10.70) 

Also, the data processing inequality applies to the Markov chain X — > Y —> X: 

I(X;Y)>l(x;X), (10.71) 

and implies the following inequality: 

h(x\x) >H(X\Y). (10.72) 

Consider the following chain of inequalities: 

h(eX\x\ = h(e\X\ +h(x\EX\ (10.73) 

<H(E) + h(x\EX\ (10.74) 

= H 2 {p e )+ Pe H(x\X,E = l) 

+ (l-p e )H(x\X,E = 0\ (10.75) 

<ff 2 (p e )+p e log(|A?|-l). (10.76) 






The first equality follows by expanding the entropy H I EX\X J . The first inequality follows 
because conditioning reduces entropy. The second equality follows by explicitly expanding 
the conditional entropy H (X\EX\ in terms of the two possibilities of the error random 

variable E. The last inequality follows from two facts: there is no uncertainty about X 
when there is no error (when E = 0) and X is available, and the uncertainty about X when 
there is an error (when E = 1) and X is available is less than the uncertainty of a uniform 
distribution r^py on all of the other possibilities. Fano's inequality follows from putting 



together (10.72), (10.70), and (10.76): 



H(X\Y) < H(x\x) =H[EX\X\ < H 2 (p e )+Pelog(\X\ - 1). (10.77) 

□ 

10.8 Classical Information and Entropy of Quantum Sys- 
tems 

We can always process classical information by employing a quantum system as the carrier 
of information. The inputs and the outputs to a quantum protocol can both be classical. 
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For example, we can prepare a quantum state according to some random variable X— 
the ensemble {px{x), p x } captures this idea. We can retrieve classical information from a 
quantum state in the form of some random variable Y by performing a measurement— 
the POVM {A y } captures this notion (recall that we employ the POVM formalism from 



Section 4.2.1 if we do not care about the state after the measurement). Suppose that Alice 
prepares a quantum state according to the ensemble {px(x),p x } and Bob measures the 
state according to the POVM {A y }. Recall that the following formula gives the conditional 
probability p Y \x{y \ x): 

p Yl x(y\x)=Tr{A yPx }. (10.78) 

Is there any benefit to processing classical information with quantum systems? Later, in 



Chapter [19j we see that there indeed is an enhanced performance because we can achieve 
higher communication rates in general by processing classical data with quantum resources. 
For now, we extend our notions of entropy in a straightforward way to include the above 
ideas. 

10.8.1 Shannon Entropy of a POVM 

The first notion that we can extend is the Shannon entropy, by determining the Shannon 
entropy of a POVM. Suppose that Alice prepares a quantum state p (there is no classical 
index here). Bob can then perform a particular POVM {A x } to learn about the quantum 
system. Let X denote the random variable corresponding to the classical output of the 
POVM. The probability density function pxip) of random variable X is then 

Px (x) = Tr{A x p}. (10.79) 
The Shannon entropy H(X) of the POVM {A x } is 

H(X) = -J2Px(x)log(p x (x)) (10.80) 

= -J2 TrjA.p} log^rjA^}). (10.81) 

X 

In the next chapter, we prove that the minimum Shannon entropy over all rank-one POVMs 
is equal to a quantity known as the von Neumann entropy of the density operator p. 

10.8.2 Accessible information 

Let us consider the scenario introduced at the beginning of this section, where Alice prepares 
an ensemble £ = {px{x),p x } and Bob performs a POVM {A^}. Suppose now that Bob is 
actually trying to retrieve as much information as possible about the random variable X. 
The quantity that governs how much information he can learn about random variable X 
if he possesses random variable Y is the mutual information I(X; Y). But here, Bob can 
actually choose which measurement he would like to perform, and it would be good for 
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him to perform the measurement that maximizes his information about X. The resulting 
quantity is known as the accessible information I acc (£) of the ensemble £ (because it is the 
information that Bob can access about random variable X): 

/ acc (£) = max/(X;y), (10.82) 

{Ay} 

where the marginal density Px(%) is that from the ensemble and the conditional density 



PY\x{y\x) is given in (10.78). In the next chapter, we show how to obtain a natural bound 
on this quantity, called the Holevo bound. The bound arises from a quantum generalization 
of the data processing inequality. 

10.8.3 Classical Mutual Information of a Bipartite State 

A final quantity that we introduce is the classical mutual information I c (p AB ) of a bipartite 
state p AB . Suppose that Alice and Bob possess some bipartite state p AB and would like to 
extract maximal classical correlation from it. That is, they each retrieve a random variable 
by performing respective local POVMs {A^} and {A^} on their halves of the bipartite state 
p AB . These measurements produce respective random variables X and Y, and they would 
like X and Y to be as correlated as possible. A good measure of their resulting classical 
correlations obtainable from local quantum information processing is as follows: 

I c (p AB )= max I(X;Y), (10.83) 

where the joint distribution 

Px,y(*,y) = Tr{(A^AjV B }. (10.84) 

Suppose that the state p AB is classical, that is, it has the form 

p AB = Y.px?^y)\ x )( x \ A ® \y)(y\ Y > ( 10 - 85 ) 

where the states \x) form an orthonormal basis and so do the states \y) . Then, the optimal 
measurement in this case is for Alice to perform a von Neumann measurement in the basis 
\x) and inform Bob to perform a similar measurement in the basis \y) . The amount of 
correlation they extract is then equal to I(X;Y). 

Exercise 10.8.1 Prove that it suffices to consider maximizing over rank-one POVMs when 



computing (10.83). (Hint: Consider refining the POVM {A x } as the rank-one POVM {\4> x ,z){ ( f ) x,z\}, 
where we spectrally decompose A x as 2j0x,«)(^a:,«l> an d then exploit the data processing 
inequality.) 
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10.9 History and Further Reading 

The book of Cover and Thomas is an excellent introduction to entropy and information 
theory (some of the material in this chapter is similar to material appearing in that book) [57] . 
MacKay's book is also a good introduction [189J. E. T. Jaynes was an advocate of the 
Principle of Maximum Entropy, proclaiming its utility in several sources |161[ 11621 I163J . A 
good exposition of Fano's inequality appears on Scholarpedia [92] . 
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CHAPTER 11 



Quantum Information and Entropy 



In this chapter, we discuss several information measures that are important for quantifying 
the amount of information and correlations that are present in quantum systems. The first 
fundamental measure that we introduce is the von Neumman entropy. It is the quantum 
analog of the Shannon entropy, but it captures both classical and quantum uncertainty in a 
quantum statejj The von Neumann entropy gives meaning to a notion of the information 
qubit. This notion is different from that of the physical qubit, which is the description of 
a quantum state in an electron or a photon. The information qubit is the fundamental 
quantum informational unit of measure, determining how much quantum information is in 
a quantum system. 

The beginning definitions here are analogous to the classical definitions of entropy, but 
we soon discover a radical departure from the intuitive classical notions from the previous 
chapter: the conditional quantum entropy can be negative for certain quantum states. In 
the classical world, this negativity simply does not occur, though it takes a special meaning 
in quantum information theory. Pure quantum states that are entangled have stronger- 
than-classical spatial correlations and are examples of states that have negative conditional 
entropy. The negative of the conditional quantum entropy is so important in quantum 
information theory that we even have a special name for it: the coherent information. We 
discover that the coherent information obeys a quantum data processing inequality, placing 
it on a firm footing as a particular informational measure of quantum correlations. 

We then define several other quantum information measures, such as quantum mutual 
information, that bear similar definitions as in the classical world, but with Shannon entropies 
replaced with von Neumann entropies. This replacement may seem to make quantum entropy 



^e should point out the irony in the historical development of classical and quantum entropy. The von 
Neumann entropy has seen much widespread use in modern quantum information theory, and perhaps this 
would make one think that von Neumann discovered this quantity much after Shannon. But in fact, the 
reverse is true. Von Neumann first discovered what is now known as the von Neumann entropy and applied 
it to questions in statistical physics. Much later, Shannon determined an information-theoretic formula and 
asked von Neumann what he should call it. Von Neumann told him to call it the entropy for two reasons: 
1) it was a special case of the von Neumann entropy and 2) he would always have the advantage in a debate 
because von Neumann claimed that no one at the time really understood entropy. 
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somewhat trivial on the surface, but a simple calculation reveals that a maximally entangled 
state on two qubits registers two bits of quantum mutual information (recall that the largest 
the mutual information can be in the classical world is one bit for the case of two maximally 
correlated bits). We then discuss several information inequalities that play an important 
role in quantum information processing: the fundamental quantum information inequality, 
strong subadditivity, the quantum data processing inequality, and continuity of quantum 
entropy. 



11.1 Quantum Entropy 



We might expect a measure of the entropy of a quantum system to be vastly different from the 
classical measure of entropy from the previous chapter because a quantum system possesses 
not only classical uncertainty but also quantum uncertainty that arises from the uncertainty 
principle. But recall that the density operator captures both types of uncertainty and allows 
us to determine probabilities for the outcomes of any measurement on system A. Thus, a 
quantum measure of uncertainty should be a direct function of the density operator, just as 
the classical measure of uncertainty is a direct function of a probability density function. It 
turns out that this function has a strikingly similar form to the classical entropy, as we see 
below. 

Definition 11.1.1 (Quantum Entropy). Suppose that Alice prepares some quantum system 
A in a state p A . Then the entropy H(A) of the state is as follows: 

H (A) = -Tr{p A log p A ). (11.1) 

The entropy of a quantum system is also known as the von Neumann entropy or the 
quantum entropy but we often simply refer to it as the entropy. We can denote it by H(A) 
or H(p) to show the explicit dependence on the density operator p A . The von Neumann 
entropy has a special relation to the eigenvalues of the density operator, as the following 
exercise asks you to verify. 

Exercise 11.1.1 Consider a density operator p with the following spectral decomposition: 

/ = I>*(*><^- ( 1L2 ) 

X 

Show that the entropy H(A) is the same as the entropy H(X) of a random variable X with 
probability distribution px(x). 

In our definition of quantum entropy, we use the same notation H as in the classical case 
to denote the entropy of a quantum system. It should be clear from context whether we are 
referring to the entropy of a quantum or classical system. 

The quantum entropy admits an intuitive interpretation. Suppose that Alice generates 
a random quantum state \ip y ) in her lab according to some probability density Py(v) of a 
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random variable Y. Suppose further that Bob has not yet received the state from Alice and 
does not know which one she sent. The expected density operator from Bob's point of view 
is then 

a = EHl^XVvl} = J>Y(y)|^,)<^|- (11-3) 

V 

The interpretation of the entropy H(a) is that it quantifies Bob's uncertainty about the 
state Alice sent — his expected information gain is H(cr) qubits upon receiving and measuring 
the state that Alice sends. Schumacher's noiseless quantum coding theorem, described in 



Chapter 17 gives an alternative operational interpretation of the von Neumann entropy by 
proving that Alice needs to send Bob qubits at a rate H(a) in order for him to be able to 
decode a compressed quantum state. 

The above interpretation of quantum entropy seems qualitatively similar to the inter- 
pretation of classical entropy. Though, there is a significant quantitative difference that 
illuminates the difference between Shannon entropy and von Neumann entropy. We consider 
an example. Suppose that Alice generates a sequence l^i)^) ■ ■ ■ \i/j n ) of quantum states 
according to the following "BB84" ensemble: 

{{1/4, |0}}, {1/4, |1», {1/4, |+}}, {1/4, |-}}}. (11.4) 

Suppose that her and Bob share a noiseless classical channel. If she employs Shannon's 
classical noiseless coding protocol, she should transmit classical data to Bob at a rate of two 
classical channel uses per source state \ijji) in order for him to reliably recover the classical 
data needed to reproduce the sequence of states that Alice transmitted (the Shannon entropy 
of the uniform distribution 1/4 is 2 bits). 

Now let us consider computing the von Neumann entropy of the above ensemble. First, 
we determine the expected density operator of Alice's ensemble: 

|(|0}(0| + |l)(l| + |+)(+| + |-)(-|)=7T, (11.5) 

where it is the maximally mixed state. The von Neumann entropy of the above density 
operator is one qubit because the eigenvalues of tt are both equal to 1/2. Suppose now 
that Alice and Bob share a noiseless quantum channel between them — this is a channel that 
can preserve quantum coherence without any interaction with an environment. Then Alice 
only needs to send qubits at a rate of one channel use per source symbol if she employs a 



protocol known as Schumacher compression (we discuss this protocol in detail in Chapter 17). 
Bob can then reliably decode the qubits that Alice sent. The protocol also causes only 
an asymptotically vanishing disturbance to the state. The above departure from classical 



information theory holds in general — Exercise 11.9.2 of this chapter asks you to prove that 



the Shannon entropy of any ensemble is never less than the von Neumann entropy of its 
expected density operator. 

11.1.1 Mathematical Properties of Quantum Entropy 

We now discuss several mathematical properties of the quantum entropy: positivity, its 
minimum value, its maximum value, its invariance under unitaries, and concavity. The first 
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three of these properties follow from the analogous properties in the classical world because 
the von Neumann entropy of a density operator is the Shannon entropy of its eigenvalues 



see Exercise 11.1.1). We state them formally below: 



Property 11.1.1 (Positivity) The von Neumann entropy H(p) is non- negative for any 
density operator p: 

H(p) > 0. (11.6) 

Proof. Positivity of quantum entropy follows from positivity of Shannon entropy. □ 

Property 11.1.2 (Minimum Value) The minimum value of the von Neumann entropy is 
zero, and it occurs when the density operator is a pure state. 

Proof. The minimum value equivalently occurs when the eigenvalues of a density operator 
are distributed with all the mass on one value and zero on the others, so that the density 
operator is rank one and corresponds to a pure state. □ 

Why should the entropy of a pure quantum state vanish? It seems that there is quantum 
uncertainty inherent in the state itself and that a measure of quantum uncertainty should 
capture this fact. This last observation only makes sense if we do not know anything about 
the state that is prepared. But if we know exactly how it was prepared, we can perform a 
special quantum measurement to verify that the quantum state was prepared, and we do 
not learn anything from this measurement because the outcome of it is always certain. For 
example, suppose that Alice always prepares the state \(j>) and Bob knows that she does so. 
He can then perform a measurement of the following form {|0)(0|, i" — |0)(</>|} to verify that 
she prepared this state. He always receives the first outcome from the measurement and 
never gains any information from it. Thus, it make sense to say that the entropy of a pure 
state vanishes. 

Property 11.1.3 (Maximum Value) The maximum value of the von Neumann entropy is 
log D where D is the dimension of the system, and it occurs for the maximally mixed state. 

Proof. The proof of the above property is the same as in the classical case. □ 

Property 11.1.4 (Concavity) The entropy is concave in the density operator: 

H(p)>J2Px(x)H(p x ), (11.7) 

X 

where p = J2 x Px(x)p x . 

The physical interpretation of concavity is as before for classical entropy: entropy can 
never decrease under a mixing operation. This inequality is a fundamental property of the 



entropy, and we prove it after developing some important entropic tools (see Exercise 11.6.9). 



Property 11.1.5 (Unitary Invariance) The entropy of a density operator is invariant 
under unitary operations on it: 

H(p) = H(Uprf). (11.8) 
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Proof. Unitary invariance of entropy follows by observing that the eigenvalues of a density 
operator are invariant under a unitary: 

UpU ] = Uj2Px(x)\x){x\U j (11.9) 

X 

= 5>x(z)|^}<<U (11.10) 

x 

where {|0 X }} is some orthonormal basis such that U\x) = \(f) x ). The above property follows 
because the entropy is a function of the eigenvalues of a density operator. □ 

A unitary operator is the quantum analog of a permutation in this context (consider 



Property 10.1.3 of the classical entropy). 



Exercise 11.1.2 The purity of a density operator p A is Trs (p A ) >. Suppose p A =Ttb\ ( < & + ) (■ 
Prove that the purity is equal to the inverse of the dimension d in this case. 

11.1.2 Alternate Characterization of the von Neumann Entropy 

There is an interesting alternate characterization of the von Neumann entropy of a state p 
as the minimum Shannon entropy of a rank-one POVM performed on it (we discussed this 



briefly in Section 10.8.1). That is, we would like to show that 



H(p) = min-^Tr{A,p}log 2 (Tr{A,p}), (11.11) 

{Ay} y 

where the minimum is restricted to be over rank-one POVMs (those with A y = \4> y )(4> y \ for 
some vectors \<p y ) such that Tr{\(f)y)((f) y \} < 1 and ^2 y \^y)(^ y \ = -0- I* 1 this sense, there 
is some optimal measurement to perform on p such that its entropy is equivalent to the 
von Neumann entropy, and this optimal measurement is the "right question to ask" (as we 



discussed early on in Section 1.2.2). 



In order to prove the above result, we should first realize that a von Neumann measure- 
ment in the eigenbasis of p should achieve the minimum. That is, if p = J2 x Px{ x )\ x )( x \, 
we should expect that the measurement {|x)(x|} achieves the minimum. In this case, the 
Shannon entropy of the measurement is equal to the Shannon entropy of px{x), as discussed 



in Exercise |11.1.1[ We now prove that any other rank-one POVM has a higher entropy 
than that given by this measurement. Consider that the distribution of the measurement 
outcomes for {\4>y){4> y \} is equal to 

Tr{|^)<0» = ]T|<0»| W), (11.12) 

X 

so that we can think of |(0j£)| as a conditional probability distribution. Introducing 
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f(p) = —p\og 2 p, which is a concave function, we can write the von Neumann entropy as 

H(p) = J2f(Px(x)) (11.13) 

X 

= J2f(Px(x)) + f(px(x )), (11.14) 

x 

where x is a symbol added to the alphabet of x such that Pxi^o) — 0. Let us denote 
the enlarged alphabet with the symbols x' so that H(p) = ^2 x i f(px(x'))- We know that 
2J( < / , i/l a ')l = 1 from the fact that the set {\4>y){4>y\} forms a POVM and \x) is a normalized 
state. We also know that X^K^I 3 -)! — 1 because Tr{|<^j / }(<^ y |} < 1 for a rank-one POVM. 
Thinking of |(0 y |x)| as a distribution over x, we can add a symbol xq with probability 
1 — {4> y \(f)y) so that it makes a normalized distribution. Let us call this distribution p(x'\y). 
We then have that 

H(p) = Y,f(Px(x)) (11.15) 

X 

= £lto,l*)lV(p*o«0) ( n - 16 ) 

= E^^)/(^^')) (H-17) 

= E(E^^)/(^(^))j (11-18) 

y \ x' / 

<J2f[J2p( x> \y^( x> y\ ( n - 19 ) 

y \ x' / 

= £/(Tr{|0,)<0>}). (11.20) 

The third equality follows from our assumption that px(xo) for the added symbol Xq. The 
only inequality follows from concavity of /. The last expression is equivalent to the Shannon 
entropy of the POVM {\4> y )(4> y \} when performed on the state p. 

11.2 Joint Quantum Entropy 

The joint quantum entropy H(AB) of the density operator p AB for a bipartite system AB 
follows naturally from the definition of quantum entropy: 

H(AB) p = -Tr{p AB logp Ai3 }. (11.21) 

We introduce a few of its properties in the below subsections. 
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11.2.1 Marginal Entropies of a Pure Bipartite State 

The five properties of quantum entropy in the previous section may give you the impression 
that the nature of quantum information is not too different from that of classical information. 
We proved all these properties for the classical case, and their proofs for the quantum case 
seem similar. The first three even resort to the proofs in the classical case! 



Theorem |11.2.1| below is where we observe our first radical departure from the classical 
world. It states that the marginal entropies of a pure bipartite state are equal, while the 
entropy of the overall state remains zero. Recall that the joint entropy H(X,Y) of two 
random variables X and Y is never less than one of the marginal entropies H(X) or H{Y): 

H(X,Y)>H(X), (11.22) 

H(X,Y)>H(Y). (11.23) 

The above inequalities follow from the positivity of classical conditional entropy. But in the 
quantum world, these inequalities do not always have to hold, and the following theorem 
demonstrates that they do not hold for an arbitrary pure bipartite quantum state with 



Schmidt rank greater than one (see Theorem 3.6.1 for a definition of Schmidt rank). The 



fact that the joint quantum entropy can be less than the marginal quantum entropy contains 
in it one of the most fundamental differences between classical and quantum information. 

Theorem 11.2.1. The marginal entropies H(A), and H(B), of a pure bipartite state |0) 
are equal: 

H(A)+ = H{B)^ (11.24) 

while the joint entropy H(AB), vanishes: 

H(AB)^ = 0. (11.25) 

Proof. Th e cruc ial ingredient for the proof of this theorem is the Schmidt decomposition 



(Theorem 3.6.1). Recall that any bipartite state \(f)) admits a Schmidt decomposition of 



the following form: 

\<P) AB = J2V\\i) A \i) B , (11-26) 

i 

where \i) is some orthonormal set of vectors on system A and \i) is some orthonormal set 
on system B. Recall that the Schmidt rank is equal to the number of non-zero coefficients 
Aj. Then the respective marginal states p A and p B on systems A and B are as follows: 

i 

P B = Y J X ^^\ B - ( n - 28 ) 

i 

Thus, the marginal states admit a spectral decomposition with the same eigenvalues. The 
theorem follows because the von Neumann entropy depends only on the eigenvalues of a 
given spectral decomposition. □ 
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The theorem applies not only to two systems A and B, but it also applies to any number 
of systems if we make a bipartite cut of the systems. For example, if the state is \(f>) , 

then the following equalities (and others from different combinations) hold by applying The- 
orem [TT72J] and Remark 13.6.11 

H{A)+ = H(BCDE)^ (11.29) 

H(AB)^ = H(CDE)^ (11.30) 

H(ABC)+ = H{DE)^ (11.31) 

H{ABCD)^ = H{E)^ (11.32) 

The closest analogy in the classical world to the above property is when we copy a random 
variable X. That is, suppose that X has a distribution px{%) an d X is some copy of it so that 
the distribution of the joint random variable XX is px{x)8x,x- Then the marginal entropies 
H(X) and H(X) are both equal. But observe that the joint entropy H(XX) is also equal 
to H(X) and this is where the analogy breaks down. 

11.2.2 Additivity 

The quantum entropy is additive for tensor product states: 

H(p®a) = H{p) + H{a). (11.33) 

One can verify this property simply by diagonalizing both density operators and resorting 
to the additivity of the joint Shannon entropies of the eigenvalues. 

Additivity is a property that we would like to hold for any measure of information. 
For example, suppose that Alice generates a large sequence (V^i)!^) ' ' ' IVw) °f quantum 
states according to the ensemble {pi(x), \ip x )}- She may be aware of the classical indices 
X\%2 ■ ■ ■ x n , but a third party to whom she sends the quantum sequence may not be aware 
of these values. The description of the state to this third party is then as follows: 

p ® • • • <2> p, (11.34) 

where p = Kx{\fpx){' i Px\}, and the quantum entropy of this n-fold tensor product state is 

H(p®---®p)=nH(p), (11.35) 



by applying (11.33) inductively. 



11.2.3 Joint Quantum Entropy of a Classical-Quantum State 

Recall that a classical-quantum state is a bipartite state in which a classical system and a 
quantum system are classically correlated. An example of such a state is as follows: 

p XB = ^Tp x (x)\x)(x\ X ®p*. (11.36) 

X 

The joint quantum entropy of this state takes on a special form that appears similar to 
entropies in the classical world. 
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Theorem 11.2.2. The joint entropy H(XB) of a classical- quantum state is as follows: 

H{XB) p = H(X) + Y J Px(x)H{p x ) : (11.37) 

X 

where H{X) is the entropy of a random variable with distribution px{x). 

Proof. First, suppose that the conditional density operators p B have the following spectral 
decomposition: 

p* =Y,py\x(y\ x )\y*)(y*\ B > ( 1L38 ) 

y 

where we write the eigenstates \y x ) with a subscript x to indicate that the basis {|j/x)} may 
be different for each value of x. Then 

H{XB) p 

= -Tv{p XB logp XB } (11.39) 

T, x , y Px(x)p Y \x(y\x)\x){x\ x (g) \y x ){y x \ B x 

}°gY< x >,y'Px(x')pY\x(y'\x')\x'){x'\ x <g> \y' x ,)(y x . 

Yl Px(x)p Y \x(y\x) log(px(x')p Y \x(y'\x')) X 
x,y,x',y' 

Tr{\x)(x\x')(x'\ x ® \y x )(y x \y x ,)(y x ,\ B } (11.41) 



T*{ X,V (^^ \ t^ f .J^^,^X^ X .JM a j,B\) ( n - 40 ) 



The first equality follows by definition. The second equality follows by expanding p 



XB 



with (11.36) and (11.38). The third equality follows because f(A) = f(^2idi\i){i\) 



J2i f (ai)\i) {i\ where X)i a iK)(*l is a spectral decomposition of A. Continuing, 

= - J2 Px(x)PY\x(y\x)log(p x (x) P Yix(y'\x))TT{\y x )(y x \y' x )(y' x \ B } (11.42) 

x,y,y' 

= -^2px(x)pY\x(y\x)log(px(x)p Y \x(y\x))Tr{\y x )(y x \ B } (11.43) 

x,y 

= -J2px(x)p Y \x(y\x) log(px(x)p Y \x(y\x)) (11.44) 

x,y 
= -^Px(x) log(p x (x)) - J^px(x) ^2p Y \x(y\x) log(p Y \x(y\x)) (11.45) 

x,y x y 

= H(X) + J2px(x)H( Px ). (11.46) 

x 

The first equality follows from linearity of trace and partially evaluating it. The second 
equality follows because the eigenstates {|2/a;)} form an orthonormal basis for the same x. 

The third equality follows because Tr< \y x )(y x \ r — 1. The fourth equality follows be 



)ecause 
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the logarithm of the products is the sum of the logarithms. The final equality follows from 
the definition of entropy and because — J2 y PY\x(y\x) \og(pY\x(y\x)) is the quantum entropy 
of the density operator p x . D 

As we stated earlier, the joint quantum entropy of a classical-quantum state takes on a 
special form that is analogous to the classical joint entropy. Inspection of the third to last 



line above reveals this similarity because it looks exactly the same as the formula in (10.23). 



We explore this connection in further detail in Section 11.4.1 



11.3 Potential yet Unsatisfactory Definitions of Condi- 
tional Quantum Entropy 

The conditional quantum entropy may perhaps seem a bit difficult to define at first because 
there is no formal notion of conditional probability in the quantum theory. Though, there 
are two senses which are perhaps closest to the notion of conditional probability, but both of 
them do not lead to satisfactory definitions of conditional quantum entropy. Nevertheless, 
it is instructive for us to explore both of these notions for a bit. The first arises in the noisy 
quantum theory, and the second arises in the purified quantum theory. 

We develop the first notion. Consider an arbitrary bipartite state p AB . Suppose that 
Alice performs a complete von Neumann measurement II = {|x)(x|} of her system in the 
basis {|x}}. This procedure leads to an ensemble {px(x), \x)(x\ ® p x }, where 

Px = ^y Tr 4 {\ X )( X \ A ® i b )p ab (\*){A a ® i B ) }, (ii-47) 

p x (x) = Tr{ (\x){x\ A <g> I B )p AB ). (11.48) 

One could then think of the density operators p x as being conditional on the outcome of the 
measurement, and these density operators describe the state of Bob given knowledge of the 
outcome of the measurement. 

We could potentially define a conditional entropy as follows: 

H(B\A) n = J2px(x)H( Px I (11.49) 



in analogy with the definition of the classical entropy in (10.18). This approach might seem to 
lead to a useful definition of conditional quantum entropy, but the problem with it is that the 
entropy depends on the measurement chosen (the notation H(B\A) U explicitly indicates this 
dependence). This problem does not occur in the classical world because the probabilities for 
the outcomes of measurements do not themselves depend on the measurement selected, unless 
we apply some coarse graining to the outcomes. Though, this dependence on measurement 
is a fundamental aspect of the quantum theory. 

We could then attempt to remove the dependence of the above definition on a particular 
measurement II by defining the conditional quantum entropy to be the minimization of 
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H(B\A) n over all possible measurements. The intuition here is perhaps that entropy should 
be the minimal amount of conditional uncertainty in a system after employing the best 
possible measurement on the other. Though, the removal of one problem leads to another! 
This optimized conditional entropy is now difficult to compute as the system grows larger, 
whereas in the classical world, the computation of conditional entropy is simple if one knows 
the conditional probabilities. The above idea is useful, but we leave it for now because 
there is a simpler definition of conditional quantum entropy that plays a fundamental role 
in quantum information theory. 

The second notion of conditional probability is actually similar to the above notion, 
though we present it in the purified viewpoint. Consider a tripartite state \ip) and a 



bipartite cut A\BC of the systems A, B, and C. Theorem 3.6.1 states that every bipartite 



state admits a Schmidt decomposition, and the state \ifi) is no exception. Thus, we can 
write a Schmidt decomposition for it as follows: 

\^) ABC = E VpAx)\x) A \4>,) BC \ (11-50) 

X 

where Px( x ) is some probability density, {\x)\ is an orthonormal basis for the system A, and 
{\4> x )} is an orthonormal basis for the systems BC. Each state \(f) x ) is a pure bipartite 
state, so we can again apply a Schmidt decomposition to each of these states: 



<) BC = E yPY\x{y\x)\y x ) B \y x ) c , (11.51) 



y 



where py\x(y\x) is some conditional probability distribution depending on the value of x, 
and {\y x ) } and {\y x ) } are both orthonormal bases with dependence on the value x. Thus, 
the overall state has the following form: 



. abc _ sr^ 

x,y 



\/PY\x(y\x)px(x)\x) A \y x ) B \y x ) c . (11.52) 



i-i-i 



Suppose that Alice performs a von Neumann measurement in the basis {|a;)(x| }. The 
state on Bob and Charlie's systems is then \if) x ) , and each system on B or C has a 
marginal entropy of H(a x ) where o x = J2 y PY\x(y\x)\y x )(y x \. We could potentially define 
the conditional quantum entropy as 

J2Px(x)H(a x ). (11.53) 

X 

This quantity does not depend on a measurement as before because we simply choose the 
measurement from the Schmidt decomposition. But there are many problems with the above 
notion of conditional quantum entropy: it is defined only for pure quantum states, it is not 
clear how to apply it to a bipartite quantum system, and the conditional entropy of Bob's 
system given Alice's and that of Charlie's given Alice's is the same (which is perhaps the 
strangest of all!). Thus this notion of conditional probability is not useful for a definition of 
conditional quantum entropy. 
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11.4 Conditional Quantum Entropy 

The definition of conditional quantum entropy that has been most useful in quantum infor- 
mation theory is the following simple one, inspired from the relation between joint entropy 



and marginal entropy in Exercise 10.3.1 



Definition 11.4.1 (Conditional Quantum Entropy). The conditional quantum entropy H(A\B) 
of a bipartite quantum state p AB is the difference of the joint quantum entropy H(AB) and 
the marginal H(B) : 

H(A\B) p = H(AB) p -H(B) p . (11.54) 

The above definition is the most natural one, both because it is straightforward to com- 
pute for any bipartite state and because it obeys many relations that the classical conditional 
entropy obeys (such as chaining rules and conditioning reduces entropy). We explore many 
of these relations in the forthcoming sections. For now, we state "conditioning cannot in- 
crease entropy" as the following theorem and tackle its proof later on after developing a few 
more tools. 

Theorem 11.4.1 (Conditioning does not increase entropy). Consider a bipartite quantum 
state p AB . Then the following inequality applies to the marginal entropy H(A) and the 
conditional quantum entropy H(A\B) : 

H(A) p > H(A\B) p . (11.55) 

We can interpret the above inequality as stating that conditioning cannot increase entropy, 
even if the conditioning system is quantum. 

11.4.1 Conditional Quantum Entropy for Classical-Quantum States 

A classical-quantum state is an example of a state where conditional quantum entropy be- 
haves as in the classical world. Suppose that two parties share a classical-quantum state 



p XB of the form in (11.36). The system X is classical and the system B is quantum, and 
the correlations between these systems are entirely classical, determined by the probabil- 
ity distribution px{x). Let us calculate the conditional quantum entropy H(B\X) for this 
state: 

H(B\X) p = H(XB) p -H(X) p (11.56) 

= H(X) + J2px(x)H( Px ) - H(X) (11.57) 

= Y,Px(x)H(p x ). (11.58) 



The first equality follows from Definition |11.4.1[ The second equality follows from Theo- 
rem 



11.2.2, and the final equality results from algebra. 



The above form for conditional entropy is completely analogous with the classical formula 



in (10.18) and holds whenever the conditioning system is classical. 
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11.4.2 Negative Conditional Quantum Entropy 



One of the properties of the conditional quantum entropy in Definition 11.4.1 that seems 



counterintuitive at first sight is that it can be negative. This negativity holds for an ebit 
|$ + ) shared between Alice and Bob. The marginal state on Bob's system is the maximally 
mixed state ir B . Thus, the marginal entropy H(B) is equal to one, but the joint entropy 
vanishes, and so the conditional quantum entropy H(A\B) = — 1. 

What do we make of this result? Well, this is one of the fundamental differences between 
the classical world and the quantum world, and perhaps is the very essence of the departure 
from an informational standpoint. The informational statement is that we can sometimes be 
more certain about the joint state of a quantum system than we can be about any one of its 
individual parts, and this is the reason that conditional quantum entropy can be negative. 
This is in fact the same observation that Schrodinger made concerning entangled states |215| : 

"When two systems, of which we know the states by their respective represen- 
tatives, enter into temporary physical interaction due to known forces between 
them, and when after a time of mutual influence the systems separate again, then 
they can no longer be described in the same way as before, viz. by endowing each 
of them with a representative of its own. I would not call that one but rather 
the characteristic trait of quantum mechanics, the one that enforces its entire 
departure from classical lines of thought. By the interaction the two represen- 
tatives [the quantum states] have become entangled. Another way of expressing 
the peculiar situation is: the best possible knowledge of a whole does not nec- 
essarily include the best possible knowledge of all its parts, even though they 
may be entirely separate and therefore virtually capable of being 'best possibly 
known,' i.e., of possessing, each of them, a representative of its own. The lack of 
knowledge is by no means due to the interaction being insufficiently known - 
at least not in the way that it could possibly be known more completely — it is 
due to the interaction itself." 

These explanations might aid somewhat in understanding a negative conditional entropy, 
but the ultimate test for whether we truly understand an information measure is if it is the 
answer to some operational task. The task where we can interpret the conditional quantum 
entropy is known as state merging. Suppose that Alice and Bob share n copies of a bipartite 
state p AB where n is a large number and A and B are qubit systems. We also allow them 
free access to a classical side channel, but we count the number of times that they use a 
noiseless qubit channel. Alice would like to send Bob qubits over a noiseless qubit channel 
so that he receives her share of the state p AB , i.e., so that he possesses all of the A shares. 
The naive approach would be for Alice simply to send her shares of the state over the 
noiseless qubit channels, i.e., she would use the channel n times to send all n shares. But 
the state merging protocol allows her to do much better, depending on the state p AB . If 
the state p AB has positive conditional quantum entropy, she needs to use the noiseless qubit 
channel only nH(A\B) times (we will prove later that H(A\B) < 1 for any bipartite state 
on qubit systems). Though, if the conditional quantum entropy is negative, she does not 
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need to use the noiseless qubit channel at all, and at the end of the protocol, Alice and 
Bob share nH(A\B) noiseless ebits! They can then use these ebits for future communication 
purposes, such as a teleportation or super-dense coding protocol (see Chapter [6]). Thus, a 
negative conditional quantum entropy implies that Alice and Bob gain the potential for future 
quantum communication, making clear in an operational sense what a negative conditional 



quantum entropy meansn (We will cover this protocol in Chapter 21). 



Exercise 11.4.1 Show that H(A\B) p = H(A\BC) a if a ABC = p AB (g> t c . 

11.5 Coherent Information 

Negativity of the conditional quantum entropy is so important in quantum information 
theory that we even have an information quantity and a special notation to denote the 
negative of the conditional quantum entropy: 

Definition 11.5.1 (Coherent Information). The coherent information I (A) B) of a bipartite 
state p AB is as follows: 

I(A)B) p = H(B) p -H(AB) p . (11.59) 

You should immediately notice that this quantity is the negative of the conditional quan- 



tum entropy in Definition |11.4.1[ but it is perhaps more useful to think of the coherent 
information not merely as the negative of the conditional quantum entropy, but as an in- 
formation quantity in its own right. This is why we employ a separate notation for it. The 
"/" is present because the coherent information is an information quantity that measures 
quantum correlations, much like the mutual information does in the classical case. For ex- 
ample, we have already seen that the coherent information of an ebit is equal to one. Thus, 
it is measuring the extent to which we know less about part of a system than we do about 
its whole. Perhaps surprisingly, the coherent information obeys a quantum data processing 



inequality (discussed in Section 11.9.3), which gives further support for it having an "J" 
present in its notation. The Dirac symbol "}" is present to indicate that this quantity is 
a quantum information quantity, having meaning really only in the quantum world. The 
choice of ")" over "(" also indicates a directionality from Alice to Bob, and this notation will 
make more sense when we begin to discuss the coherent information of a quantum channel 



in Chapter 12. 



Exercise 11.5.1 Calculate the coherent information I(A)B)^ of the maximally entangled 
state 

l$>^ * y^^. (ii.eo) 



2 After Horodecki, Oppenheim, and Winter published the state merging protocol |148j . the Bristol Evening 
Post featured a story about Andreas Winter with the amusing title "Scientist Knows Less Than Nothing," 
as a reference to the potential negativity of conditional quantum entropy. Of course, such a title may seem a 
bit non-sensical to the layman, but it does grasp the idea that we can know less about a part of a quantum 
system than we do about its whole. 
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Calculate the coherent information I(A)B)$ of the maximally correlated state 

^ AB = ^Z)K)(*I A ®K)(*I B - ( n - 61 ) 

i=l 

Exercise 11.5.2 Consider a bipartite state p AB . Consider a purification of this state to 
some environment system E. Show that 

I(A)B) p = H(B) p -H(E) p . (11.62) 

Thus, there is a sense in which the coherent information measures the difference in the 
uncertainty of Bob and the uncertainty of the environment. 

Exercise 11.5.3 Show that I(A)B) = H(A\E) for the purification in the above exercise. 

The coherent information can be both negative and positive depending on the bipartite 
state on which we evaluate it, but it cannot be arbitrarily large. The following theorem 
places a useful bound on its absolute value. 

Theorem 11.5.1. Suppose that Alice and Bob share a bipartite state p AB . The following 
bound applies to the absolute value of the conditional entropy H(A\B): 

\H(A\B)\< log d A , (11.63) 

where d& is the dimension of Alice 's system. 

Proof. We first prove the inequality H(A\B) < logdyi in two steps: 

H(A\B) < H(A) (11.64) 

<logd A . (11.65) 



The first inequality follows because conditioning reduces entropy (Theorem 11.4.1), and the 
second inequality follows because the maximum value of the entropy H(A) is logoff- We 
now prove the inequality H(A\B) > — logoff. Consider a purification \tf)) of the state 

p AB . We then have that 

H(A\B) = -H(A\E) (11.66) 

> -H(A) (11.67) 

>-logd A . (11.68) 



The first equality follows from Exercise 11.5.3. The first and second inequalities follow by 



the same reasons as the inequalities in the previous paragraph. □ 

Exercise 11.5.4 (Conditional Coherent Information) Consider a tripartite state p . 
Show that 

I(A)BC) p = I(A)B\C) p , (11.69) 

where I(A)B\C) = H(B\C) — H(AB\C) is the conditional coherent information. 
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Exercise 11.5.5 (Conditional Coherent Information of a Classical-Quantum State) 

Suppose we have a classical-quantum state a XAB where 

a XAB = Y,Px{x)\x){x\®ai B . (11.70) 

X 

Show that 

I(A)BX) aXAB = Y J Px{x)I{A)B) aiB (11.71) 

X 

11.6 Quantum Mutual Information 

The standard informational measure of correlations in the classical world is the mutual 
information, and such a quantity plays a prominent role in measuring classical and quantum 
correlations in the quantum world as well. 

Definition 11.6.1 (Quantum Mutual Information). The quantum mutual information of a 
bipartite state p AB is as follows: 

I{A- B) p = H{A) p + H(B) p - H(AB) p . (11.72) 

The following relations hold for quantum mutual information, in analogy with the clas- 
sical case: 

I(A;B) p = H(A) p -H(A\B) p (11.73) 

= H{B) p -H{B\A) p . (11.74) 

These immediately lead to the following relations between quantum mutual information and 
the coherent information: 

I{A-B) p = H{A) p + I{A)B) p (11.75) 

= H{B) p + I{B)A) p (11.76) 

The below theorem gives a fundamental lower bound on the quantum mutual information — 
we merely state it for now and give a full proof later. 

Theorem 11.6.1 (Positivity of Quantum Mutual Information). The quantum mutual in- 
formation I(A; B) of any bipartite quantum state p AB is positive: 

I(A;B) p >0. (11.77) 

Exercise 11.6.1 (Proof that conditioning does not increase entropy) Show that 
positivity of quantum mutual information implies that conditioning does not increase entropy 



(Theorem 11.4.1). 
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Exercise 11.6.2 Calculate the quantum mutual information I(A;B)$ of the maximally 
entangled state $ AB . Calculate the quantum mutual information I(A; B)^ of the maximally 

i i nr AB 

correlated state <P 

Exercise 11.6.3 (Bound on Quantum Mutual Information) Prove that the following 
bound applies to the quantum mutual information: 

I(A; B) < 2min{logd A ,logd i? }, (11.78) 

where (1a is the dimension of system A and ds is the dimension of system B. 

Exercise 11.6.4 Consider a pure state \tf)) . Suppose that an isometry U acts on 

the A system to produce the state \<f>) . Show that 

I(R;B) <p + I(R;E) ( , = I(R;A) r (11.79) 

Exercise 11.6.5 Consider a tripartite state \if)) . Suppose that an isometry U A ^ BE acts 
on the A system to produce the state \4>) . Show that 

I(R; A)^ + I(R; S)^ = I(R; B)+ + I(R; SE)^ (11.80) 

Exercise 11.6.6 (Entropy, Coherent Information, and Quantum Mutual Informa- 
tion) Consider a pure state \4>) on systems ABE. Using the Schmidt decomposition 
with respect to the bipartite cut A | BE, we can write \(j>) as follows: 

\^) ABE = E Vm^)\*) a ® \<p*) be , (11-81) 

X 

for some orthonormal states {|a;} } x ^x on system A and some orthonormal states {\4> x ) } 
on the joint system BE. Prove the following relations: 

I(A)B)^ = h(A; B)+ - hiA; E)„ (11.82) 

H(A)+ = ±I{A;B)++±I(A;E)f (11.83) 

Exercise 11.6.7 (Coherent Information and P rivate Information) We obtain a de- 

by measuring the A system in the basis 



11.6.6 



cohered version <f> of the state in Exercise 

{\x) } x &x- Let us now denote the A system as the X system because it becomes a classical 

system after the measurement: 

-4 > XBE = Y J Px{x)\x){x\ X ®cl ) B x E . (11.84) 

x 

Prove the following relation: 

I(A)B) <p = I(X;B^-I(X;E)^. (11.85) 

The quantity on the RHS is known as the private information, because there is a sense in 
which it quantifies the classical information in X that is accessible to Bob while being private 
from Eve. 
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11.6.1 Holevo information 

Suppose that Alice prepares some classical ensemble £ = [px(x),p B } an d then hands this 
ensemble to Bob without telling him the classical index x. The expected density operator of 
this ensemble is 

p B = E x { Px } = J>*(x)pf , (11-86) 

X 

and this density operator p B characterizes the state from Bob's perspective because he does 
not have knowledge of the classical index x. His task is to determine the classical index x by 



performing some measurement on his system B. Recall from Section [l0.8.2| that the accessible 
information quantifies Bob's information gain after performing some optimal measurement 
{A y } on his system B: 

/ acc (£) = niax/(A;r), (11.87) 

where Y is a random variable corresponding to the outcome of the measurement. 

What is the accessible information of the ensemble? In general, this quantity is difficult 
to compute, but another quantity, called the Holevo information, provides a useful upper 
bound. The Holevo information x(£) °f the ensemble is 



X (S) = H{p B )-Y,Px(x)H{p B ). 



(11.88) 



Exercise 11.9.1 asks you to prove this upper bound after we develop the quantum data 
processing inequality for quantum mutual information. The Holevo information characterizes 
the correlations between the classical variable X and the quantum system B. 

Exercise 11.6.8 (Quantum Mutual Information of Classical-Quantum States) 

Consider the following classical-quantum state representing the ensemble £: 



a 



XB 



J2px(x)\x){x\ 



X 



P B X - 



(11.89) 



Show that the Holevo information x(£) is equivalent to the mutual information I(X; B) a : 

X {£) = I{X-B) a . (11.90) 

In this sense, the quantum mutual information of a classical- quantum state is most similar 
to the classical mutual information of Shannon. 

Exercise 11.6.9 (Concavity of Quantum Entropy) Prove the concavity of entropy 



(Property 11.1.4) using Theorem 11.6.1 and the result of Exercise 11.6.8 



Exercise 11.6.10 Prove that the following bound applies to the Holevo information: 

I(X;B) a <logd x , (11.91) 

where dx is the dimension of the random variable X and the quantum mutual information 



is with respect to the classical- quantum state in Exercise 11.6.8. 
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11.7 Conditional Quantum Mutual Information 

We define the conditional quantum mutual information I(A;B\C) of any tripartite state 
pABC s i m ii ar iy t how we did in the classical case: 

I{A- B\C) p = H{A\C) p + H{B\C) p - H(AB\C) p . (11.92) 

One can exploit the above definition and the definition of quantum mutual information to 
prove a chain rule for quantum mutual information. 

Property 11.7.1 (Chain Rule for Quantum Mutual Information) The quantum mu- 
tual information obeys a chain rule: 

I(AB;C) =I(B;C\A) + I(A;C). (11.93) 

Exercise 11.7.1 Use the chain rule for quantum mutual information to prove the following 
relationship: 

I(A; BC) = I (AC; B) + I(A; C) - /(£; C). (11.94) 

11.7.1 Positivity of Conditional Quantum Mutual Information 

In the classical world, positivity of conditional mutual information follows trivially from pos- 



itivity of mutual information (recall Theorem 10.6.1 ). The proof of positivity of conditional 



quantum mutual information is far from trivial in the quantum world, unless the condition- 



ing system is classical (see Exercise 11.7.2). It is a wonderful thing that positivity of this 
quantity holds because so much of quantum information theory rests upon this theorem's 
shoulders (in fact, we could say that this inequality is one of the "bedrocks" of quantum in- 
formation theory). The list of its corollaries includes the quantum data processing inequality, 
the answers to some additivity questions in quantum Shannon theory, the Holevo bound, 



and others. The proof of Theorem 11.7.1 follows directly from monotonicity of quantum 



relative entropy (Theorem 11.9.1), which we prove partially in the proof of Theorem 11.9.1 



and fully in Appendix [B] of this book. 

Theorem 11.7.1 (Positivity of Conditional Quantum Mutual Information). Suppose we 
have a quantum state on three systems A, B, and C . Then the conditional quantum mutual 
information is positive: 

I(A;B\C)>0. (11.95) 



This condition is equivalent to the strong subadditivity inequality in Exercise 11.7.6, so we 
might also refer to the above inequality as strong subadditivity. 

Exercise 11.7.2 (Conditional Quantum Mutual Information of Classical-Quantum 

States) Consider a classical-quantum state a XAB of the form in (11.70). Prove the following 
relation: 

I(A-B\X) a = Y,Px(x)I(A-B) Ux . (11.96) 
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Conclude that positivity of conditional quantum mutual information is trivial in this special 
case where the conditioning system is classical, simply by exploiting positivity of quantum 



mutual information (Theorem 11.6.1). 

Exercise 11.7.3 (Conditioning Does Not Increase Entropy) Consider a tripartite 



state p ABC . Show that Theorem 11.7.1 implies the following stronger form of Theorem 11.4.1 

H(A\B) p > H{A\BC) p . (11.97) 

Exercise 11.7.4 (Concavity of Conditional Quantum Entropy) Show that strong 
subadditivity implies that conditional entropy is concave. That is, prove that 



Y,Px{x)H{A\B) px < H(A\B) p , 



(11.98) 



where p AB = Y1 x Px( x )p£ B - 

Exercise 11.7.5 (Convexity of Coherent Information) Prove that coherent information 
is convex: 

J2px(x)I(A)B) px >I(A)B) p , (11.99) 

X 

by exploiting the result of the above exercise. 



Exercise 11.7.6 (Strong Subadditivity) Theorem 11.7.1 also goes by the name of "strong 
subadditivity" because it is an example of a function (f) that is strongly subadditive: 



<j>{E) + <j>{F) > 4>{E n F) + (j){E U F). 



;n.ioo) 



Show that positivity of quantum conditional mutual information implies the following strong 
subadditivity relation: 



H(AB) + H{BC) > H{B) + H{ABC), 



(11.101) 



AB, and the argument F in (11.100) as BC 



where we think of (f) in (11.100) as the entropy function H, the argument E in (11.100) as 



11.8 Quantum Relative Entropy 



The quantum relative entropy D(p || a) between two states p and a is as follows: 

D(p\\a) = Tr{p(log(p) - log(a))}. (11.102) 

Similar to the classical case, we can intuitively think of it as a distance measure between 
quantum states. But it is not strictly a distance measure in the mathematical sense because 
it is not symmetric and does not obey a triangle inequality. Nevertheless, the quantum 
relative entropy is always non-negative. 
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Theorem 11.8.1 (Positivity of Quantum Relative Entropy). The relative entropy D(p || a) 
is positive for any two density operators p and a: 

D(p\\a)>0. (11.103) 

Proof. Consider a spectral decomposition for p and a: 

p = Y,P( x )\<t>x)(<t>xl ° = I>(2/)I^/X^I> (11.104) 

X y 

where {|0a;}} and {(V^/)} are generally different orthonormal bases. Then we explicitly eval- 



uate the formula in (11.102) for the quantum relative entropy: 
D(p\\a) 

= Tri^p(x)|^>(^|log(^p(x / )|^><^| 

-Tri^p(x)|^)(^|logf^ ? (y)|^)(^|) \ (11.105) 

\ x \ y 

= Tr|^^)l^)(^l(l^log(p(^))l^}(^ 

K x \ x' 

-Tri^p(x)|^)(^|^log( ? (y))|^)(^| j> (11.106) 

\ x y ) 

= ^p(a;)log(p(a;))-^p(x)^|(^|^)| 2 log( g ( 2 /)) (11.107) 

x x y 

>^2p{x)log(p{x))-^2p{x)log(r(x)) (11.108) 

X X 

X 

> 0. (11.110) 

The first equality follows by a direct substitution. The second equality follows because 
f{A) = "Zli f (di)\i) {i\ for any Hermitian operator A with spectral decomposition X)i a »K)(*l- 
The third equality follows by evaluating the trace. Note that the quantity 1(0x1^)1 sums to 
one if we sum either over x or over y — thus, we can think of it either as a conditional distri- 
bution p(x\y) or p(y\x), respectively. The first inequality follows by viewing the probabilities 
\{4>x\il>y)\ as conditional probabilities p(y\x), by noting that — log(z) is a convex function 
of z, and by denning r(x) = J2 y \( ( f ) x\' l l J y)\ l{y)- Note that r(x) is a probability distribution 
because we can think of \(4>x\4>y)\ as conditional probabilities p(x\y). The fourth equality 
follows by collecting the logarithms, and the last inequality follows because the classical 



relative entropy is positive (Theorem 10.7.1). □ 
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Corollary 11.8.1 (Subadditivity of Quantum Entropy). The von Neumann entropy is sub- 
additive for a bipartite state p AB : 

H(A) p + H(B) p >H(AB) p . (11.111) 

Proof. Subadditivity of entropy is equivalent to positivity of quantum mutual information. 



We can prove positivity by exploiting the result of Exercise 11.8.2 and positivity of quantum 



relative entropy. □ 

The quantum relative entropy can sometimes be infinite. We consider a simple qubit 
example to illustrate this property and then generalize it from there. Suppose we would like 
to determine the quantum relative entropy between a pure state \ip) and a state a that is 
the following mixture: 

a = e\i;)(i;\ + (l-e)\^)(^\. (11.112) 

The states \ip) and a are e-away from being orthogonal to each other, in the sense that: 

(VkW = e. (11-113) 

Then they are approximately distinguishable by a measurement and we would expect the 
relative entropy between them to be quite high. We calculate D(\ip) \\ a): 

D(\^) \\a) = -H(m - Tr{|V)(V>| log a} (11.114) 

= -Tr{|V>(Vl(loge|V)<VI +log(l - e)|^><^|)} (11-115) 

= -loge. (11.116) 

Then, the quantum relative entropy can become infinite in the limit as e — > because 

lim — loge = +oo. (11.117) 

We generalize the above example with the following property of quantum relative entropy: 

Property 11.8.1 (Infinite Quantum Relative Entropy) Suppose the support of p and 
the orthogonal support of a have non-trivial intersection: 

supp(p) n supp(cr) 1 ^ 0. (11.118) 

Then the quantum relative entropy is infinite: 

D(p\\a) = +oo. (11.119) 

We can prove this property by a simple generalization of the above "qubit" argument 
where we consider the following mixture and take the limit of the quantum relative entropy 

as e — > 0: 

ea + (l-e)a ± , (11.120) 

where a is a strictly positive density operator that lives on the orthogonal support of a. 

We give some intuition for the latter condition in the above property. It occurs in the 
case where two states have orthogonal support, and there is always a measurement that can 
perfectly distinguish the two states in this case. 
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Exercise 11.8.1 Show that the following identity holds: 

log(p A ® p B ) = log(p A ) (81 I s + / A <8> log(p B ). (11.121) 

Exercise 11.8.2 Show that the following identity holds: 

D(/ B || / (8> P B ) = /(A; B) pAB . (11.122) 

Exercise 11.8.3 Show that the following identity holds: 

D(p AB \\I A ®p B ) = -H(A\B). 
Exercise 11.8.4 Show that the relative entropy is invariant under unitary operations: 

D(p || a) = D{UpU ] || UaU ] ). (11.123) 

Exercise 11.8.5 (Additivity of Quantum Relative Entropy) Show that the quantum 
relative entropy is additive for tensor product states: 

D(pi <8> p 2 || <ti ® cr 2 ) = D(p x || a x ) + D(p 2 \\ a 2 ). (11.124) 

Apply the above additivity relation inductively to conclude that 

D(p® n || a 0n ) = nD{p \\ a). (11.125) 

Exercise 11.8.6 (Quantum Relative Entropy of Classical-Quantum States) Show 
that the quantum relative entropy between classical-quantum states p XB and a XB is as 
follows: 

D(p XB || a XB ) = J2px(x)D( Px (I a x ), (11.126) 

X 

where 

p XB = Y,Px{x)\x)(x\ x S P B , o XB = J2 Px (x)\x)(x\ X (8 a B . (11.127) 

X X 

11.9 Quantum Information Inequalities 

11.9.1 The Fundamental Quantum Information Inequality 

The most fundamental information inequality in quantum information theory is the mono- 
tonicity of quantum relative entropy. The physical interpretation of this inequality is that 
states become less distinguishable when noise acts on them. 

Theorem 11.9.1 (Monotonicity of Quantum Relative Entropy). The quantum relative en- 
tropy between two states p and o can only decrease if we apply the same noisy map J\f to 
each state: 

D{p\\a)>D{M{p)\\M{a)). (11.128) 
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Proof. We can realize any noisy map TV by appending a state |0) to the system, applying 
some unitary U on the larger Hilbert space, and tracing out the environment system. With 
this in mind, consider the following chain of inequalities: 



D(p\\a) = D(p\\a) + D(\0)(0\ E \\\0)(0\ E ) 



D[p® |0}(0| £ || a®|0)(0| £ 



(11.129) 

(11.130) 

(11.131) 
(11.132) 



= D(u( K p®\Q){0\ E )u ] II U(a® |0)(0| £ )[/ tX 
>D{M{p)\\N{a)) 
The first equality follows because the quantum relative entropy 

(11.133) 
vanishes. The second equality follows from additivity of quantum relative entropy over tensor 



D(|0)(0| £ |||0}(0| £ 



product states (see Exercise 11.8.5). The third equality follows because the quantum relative 



entropy is invariant under unitaries (see Exercise 11.8.4). The last inequality follows from 
the following simpler form of monotonicity: 



D(p AB \\a AB ) >D(p A \\a A ). 



(11.134) 



The proof of (11.134) is rather involved, exploiting ideas from operator convex functions, 
and we prove it in full in Appendix [Bj □ 



11.9.2 Corollaries of the Fundamental Quantum Information In- 
equality 

Monotonicity of quantum relative entropy has as its corollaries many of the important in- 
formation inequalities in quantum information theory. 

Corollary 11.9.1 (Strong Subadditivity). The von Neumann entropy for any tripartite state 
p ABC is strongly subadditive: 



H(AB) p + H(BC) p > H(ABC) p + H(B). 



(11.135) 



Proof. Consider that 



D(p 



ABC 1 1 „A ,o, „BC 



p A ®p BC ) =I(A;BC) 



The first equality follows from the result of Exercise 11.8.2 A similar relation applies for 
the state p AB : 

D(p AB \\p A ®p B ) =I(A;B) p . (11.136) 

Then 



D{p ABC || p A ® p BC ) > D(p AB || p A ® p B ) 
I(A;BC) p >I(A;B) p 
I(A;B\C) p >0 



(11.137) 
(11.138) 
(11.139) 
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The first line follows from monotonicity of quantum relative entropy (tracing out the C 
system). The second line follows by using the above results. The final line follows from the 
chain rule for quantum mutual information. The last line is equivalent to the statement of 



strong subadditivity by the result of Exercise 11.7.6 Thus, strong subadditivity follows from 



monotonicity of quantum relative entropy. □ 

Corollary 11.9.2 (Joint Convexity of Quantum Relative Entropy). The quantum relative 
entropy is jointly convex in its arguments: 

D{p || a) < J2px(x)D( Px (I a x ), (11.140) 

X 

where p = J2 x Px(x)p x and a = J2 x Px(x)a x . 

Proof. Consider classical-quantum states of the following form: 

p xb = j2px(x)\x)(x\ x » p b , ( 1L141 ) 

X 

a XB = J2px(x)\x){x\ x ®<rf. (11.142) 

X 

Then the following chain of inequalities holds 

J2px(x)D( Px (I a x ) = D(p XB || a XB ) (11.143) 

X 

>D(p B \\a B ). (11.144) 



The first equality follows from the result of Exercise 11.8.6, and the inequality follows from 



monotonicity of quantum relative entropy. □ 

Corollary 11.9.3 (Complete dephasing increases entropy). Suppose that we completely de- 
phase a density operator p with respect to some dephasing basis {\y)}. Let a denote the 
dephased version of p: 

a = A Y (p) = J2\y)(y\p\y)(y\- (u-i45) 

y 

Then the entropy H(a) of the completely dephased state is greater than the entropy H(p) of 
the original state: 

H(a) > H(p). (11.146) 

Proof. Suppose that p has the following spectral decomposition: 

P = J2px(x)\x)(x\ x . (11.147) 
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Then the completely dephased state a admits the following representation: 

a = J2\y)(y\p\y)(y\ ( n - 14 s) 

y 

= ElvXvl (EM*)!*^!*) \y)(y\ ( n - 14 9) 

y \ x / 

= Y.p^ x )py\x(y\ x )\y)^ ( 1L151 ) 



x,y 



where we define py\x(y\x) = \{y\x)\ . In particular, the eigenvalues of a are py(y) = 
^2 x Px{x)pY\x(y\ x ) because a is diagonal in the dephasing basis {|y}}. We can then ex- 
ploit positivity of relative entropy to obtain the following chain of inequalities: 

0<D(p\\cr) (11.152) 

= -#(p)-Tr{ploga} (11.153) 

= -H(p) 

-Tr|^px(x)|x)(x| x iogf J2[J2px( x >Y\x(y\ x 'n\y)(y\) \ ( 1L154 ) 

= -H(p) - Tri Tp x (x)\x)(x\ x log(p Y (y))\y)(y\ I (11.155) 



x,y 

The first inequality follows from positivity of quantum relative entropy. The first equality 
follows from the definition of quantum relative entropy. The second equality follows by 
expanding the term — Trjploga}. The third equality follows by evaluating the logarithm on 
the eigenvalues in a spectral decomposition. Continuing, 

= -H(p) -Y,Px(x)\(y\x)\ 2 log(p Y (y)) (11.156) 

= -H(p) -J2PY(y)log(p Y (y)) (11.157) 

y 
= -H{p) + H{a). (11.158) 

The first equality follows from linearity of trace. The second equality follows from the 
definition of py(y)- The final equality follows because the eigenvalues of a are py(y)- D 

The quantum relative entropy itself is not equivalent to a distance measure, but it actually 
gives a useful upper bound on the trace distance between two quantum states. Thus, in this 
sense, we can think of it as being nearly equivalent to a distance measure — if the quantum 
relative entropy between two quantum states is small, then their trace distance is small as 
well. 
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Theorem 11.9.2 (Quantum Pinsker Inequality). The quantum relative entropy D(p\\o~) is 
an upper bound on the trace distance 1 1 /? — c" 1 1 1 -* 



^(Wr-°hY<» 



o- 



(11.159) 



Proof. We first prove the inequality for qubit states p and a that are diagonal in the same 
basis: 



p = p|0><0| + (1 - P )|1><1|, 

cT = g|0)<0| + (l-g)|l)<l|, 

where p > q. This corresponds to demonstrating the following inequality: 

plog(^+(l- P )log(^)>^_(p-g) 2 . 



(11.160) 
(11.161) 



(11.162) 



Consider the function g(p,q) where 



g(p,q)^plog[^j+(l-p)log(j-^ J 21n2 



4 2 

(p-q) > 



so that g(p, q) corresponds to the difference of the LHS and the RHS in (11.162). 
dg(p,q) p \-p 4 



dq 



gln2 ' (l-g)ln2 ln2^ P > 

P(l-g) g(l-p) 4 

"g(l-g)ln2 g(l-g)ln2 In 2 
q-p 4 



(q-p) 



g(l-g)ln2 In 2 
(g-p)(4g 2 -4g + l) 

g(l-g)ln2 
(g-p)(2g-l) 2 
g(l-g)ln2 



(q-p) 



<o, 



(11.163) 
Then 
(11.164) 
(11.165) 
(11.166) 
(11.167) 

(11.168) 
(11.169) 



with the last step holding from the assumption that p > q and 1 > q > 0. Also, observe that 
both dg(p, q)/dq = and g(p, g) = when p = q. Thus, the function g(p, q) is decreasing in 
q for every p whenever q < p and reaches a minimum when q = p. So g(p, q) > whenever 
p > q. The theorem also holds in general by applying it to p' = 1 — p and q' = 1 — q so that 
q' > p' . Now we prove the "fully quantum" version of the theorem for arbitrary states p and 
a by exploiting the above result. Consider the projector II onto the positive eigenspace of 



p — a and 7 — II is the projector onto the negative eigenspace (recall from Lemma 9.1.1 that 
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2Tr{II(p — a)} = \\p — cr|| .,_). Let A4 be a quantum operation that performs this projective 
measurement, so that 

M(p) = Tr{np}|0}(0| + Tr{(/ - n»|l)<l|, (11.170) 

M(a) = Tr{na}|0}(0| + Tr{(/ - n>}|l)(l|. (11.171) 

Let p =Tr{Ilp} and q =Tr{IIa}. Applying monotonicity of quantum relative entropy (The- 



orem 



11.9.1) and the proof for binary variables above gives that 

D(p || a) >D(M(j>) || M(a)) (11.172) 

>^—(p-qf (11.173) 

~ 21n2 V ^ H! V 7 

(Tr{np} - Tr{na}) 2 (11.174) 



21n2 



-(2Tr{n(p-a)}) 2 (11.175) 



21n2 
~ih-W- (11-176) 

□ 

11.9.3 The Quantum Data Processing Inequality 

The quantum data processing inequality is similar in spirit to the classical data processing 
inequality. Recall that the classical data processing inequality states that processing classical 
data reduces classical correlations. The quantum data processing inequality states that 
processing quantum data reduces quantum correlations. 

It applies to the following scenario. Suppose that Alice and Bob share some pure bipartite 
state |0) . The coherent information I(A)B), quantifies the quantum correlations present 
in this state. Bob then processes his system B according to some CPTP map Af ± ~~* J to 
produce some quantum system B\ and let p ABl denote the resulting state (in general, it 
could be a mixed state). He further processes his system B\ according to some CPTP map 
A/" 2 1_> 2 to produce some quantum system B 2 and let a AB2 denote the state resulting from 
the second map. The quantum data processing inequality states that each step of quantum 
data processing reduces quantum correlations, in the sense that 

I{A)B), > I(A)B 1 ) o > I(A)B 2 ). (11.177) 



Figure 11.1 a) depicts the scenario described above corresponding to the quantum data 



processing inequality. Figure ll.l|(b) depicts this same scenario with the isometric extensions 



of the respective maps Af t ~* l and A/" 2 x ^ 2 — this latter depiction is useful in the proof of 
the quantum data processing inequality. 

A condition similar to the Markov condition holds for the quantum case. Each of the 
maps M]_ ~* 1 and J\f 2 1 ^ 2 acts only on one of Bob's systems — it does not act in any way 
on Alice's system. This behavior is what allows us to prove the quantum data processing 
inequality. 
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Figure 11.1: Two slightly different depictions of the quantum data processing inequality, (a) Alice and 



Bob begin by sharing some pure state 



,AB 



Bob processes his system B with the CPTP map A/"i and 



further processes B\ with the CPTP map AV Quantum correlations can only decrease after quantum data 
processing, in the sense that I(A)B) > I{A)B 1 ) > 1(A) B 2 ). It also follows that I(A;B) > I(A;B X ) > 
I (A; 62). (b) The depiction on the right is the same as that on the left except we consider the respective 
isometric extensions tTjy/i and Ujf a of the channels A/i and A/2 • The quantum state after U^ is a pure state 
shared among the systems A, _Bi, and E\, and the state after Uj\f a is a pure state shared among the systems 
A, B 2 , E 2 , and E x . 



Theorem 11.9.3 (Quantum Data Processing Inequality for Coherent Information). Suppose 
that p ABl = Af^ Bl {(j) AB ) and a AB2 = N? 1 ^ 2 [p ABl ] ). Then the following quantum data 
processing inequality applies for coherent information: 

I(A)B) (t> > I(A)B l ) p > I(A)B 2 ) a . 



(11.178) 



Proof. The proof exploits the depiction of quantum data processing in Figure ll.l[ b), the 
equivalence of the marginal entropies of a pure bipartite state (Theorem 11.2.1), and strong 



subadditivity (Theorem 11.7.1). First consider that 



H(B)^-H(AB)^ 

HiB)^ 

H{A),. 



(11.179) 
(11.180) 
(11.181) 



The first equality follows by definition, the second equality follows becau se the entropy of 



a pure state vanishes, and the third equality follows from Theorem 
the output of the isometry U*f^ l l : 



11.2.1 



Let 



.ABiEi 



be 



,AB 1 E 1 



U 



B — *B\ E\ 

Mi 



,AB 



(It is also a purification of the state p ABl ). Consider that 

I(A)B l ) p = I(A)B 1 )^ 

= H(B 1 )^ - H(AB 1 ) ij 
= H(AE 1 ) f - H(E 1 )^ 



(11.182) 



(11.183) 
(11.184) 
(11.185) 
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H(A), because no processing occurs on system A. Recall from Theo- 
11.6.1 that quantum mutual information is always positive. Then the following chain of 



Note that H(A) 
rem 



inequalities proves the first inequality in Theorem 11.9.3: 



/(A;E0^>0 (11.186) 

.-. H(A)^>H(AE l )^-H(E 1 )^ (11.187) 

.-. I(A)B)+ > I(A)B x ) p . (11.188) 

The first line applies positivity of quantum mutual information. The second line applies the 



definition of quantum mutual information, and the third line applies the results in (11.181) 



and (11.185). We prove the other inequality by a similar line of reasoning (though this 



time we resort to the positivity of conditional mutual information in Theorem 11.7.1). Let 
\ V> ) AB » E ^ be the output of the isometry Uft^ 3 * 1 *: 



W) 



AB2E2EI 



ufc 



^B2^2 



.AB-lEi 



(It is also a purification of the state a AB2 ). Then 

I(A)B 2 ) a = I(A)B 2 ) v 

= H(B 2 ) v - H{AB 2 ) v 

= H{AE x E 2 ) v - HiE.E,)^ 

= H{AE 2 \E l )-H{E 2 \E l ) u 



(11.189) 

(11.190) 
(11.191) 
(11.192) 
(11.193) 



The third equality follows from Theorem |11.2.1[ and the last follows by adding and sub- 
tracting the marginal entropy H(Ei) and recalling the definition of conditional quantum 
entropy. Also, 



I(A)B 1 



HiAE,)^ - 
H(AE 1 ) ip - 

H{A\E X ) V 






(11.194) 
(11.195) 
(11.196) 



The first equality follows from (11.185). The second equality follows because there is no 



quantum processing on systems A and E\ to produce state <p from state if;. The third equality 



follows from the definition of conditional quantum entropy. Recall from Theorem 11.7.1 that 
conditional quantum mutual information is always positive. Then the following chain of 



inequalities proves the second inequality in Theorem 11.9.3: 



I(A;E 2 \E 1 ) v >0 

. H{A\E 1 ) v > H{AE 2 \E l )^ - H{E 2 \E 1 ) 
: I{A)B 1 )>I{A)B 2 ) a . 



(11.197) 
(11.198) 
(11.199) 



The first line applies positivity of conditional quantum mutual information. The second line 
applies the definition of conditional quantum mutual information, and the third line applies 
the result in (111.1931) and (111. 1961). □ 
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Corollary 11.9.4 (Quantum Data Processing Inequality for Quantum Mutual Information). 
Suppose that p ABl = M B ^ Bl ((f) AB ) and a AB2 = M Bl ^ B2 (p ABl ) . Then the following quantum 
data processing inequality applies to the quantum mutual information: 

I(A; B)+ > I(A; B l ) p > I(A; B 2 ) a . (11.200) 

That is, we have 

I(A; B) > I(A i; B) > I(A 2 ; B), (11.201) 

for some maps applied to the A system in the order A — ► Ai — ► A 2 . 

Proof. The proof follows because the quantum mutual information I(A; B) = H(A)+I(A)B) 
and we can apply the quantum data processing inequality for coherent information. Though, 
the quantum data processing inequality is symmetric for the case of quantum mutual infor- 
mation (QMI) because the QMI itself is symmetric. □ 

Exercise 11.9.1 (Holevo Bound) Use the quantum data processing inequality to show 
that the Holevo information x(£) is an upper bound on the accessible information 7 acc (£): 

4cc(£) < x(£). (11.202) 

Exercise 11.9.2 (Shannon Entropy versus von Neumann Entropy of an Ensemble) 

Consider an ensemble {px(x), \i/) x )}- The expected density operator of the ensemble is 

p = J2px(xM x ){^\. (11.203) 

X 

Use the quantum data processing inequality to show that the Shannon entropy H(X) is 
never less than the von Neumann entropy of the expected density operator p: 

H{X) > H(p). (11.204) 

(Hint: Begin with a classical common randomness state ^2 x Px(x)\x)(x\ (g>|x)(x| and apply 
a preparation map to system X'). Conclude that the Shannon entropy of the ensemble is 
strictly greater than the von Neumann entropy whenever the states in the ensemble are 
non-orthogonal . 

Exercise 11.9.3 Use the idea in the above exercise to show that the conditional entropy 
H(X\B) is always non-negative whenever the state p XB is a classical-quantum state: 

p XB ^^Px(x)\x)(x\ x '® p B . (11.205) 

x 

Exercise 11.9.4 (Separability and Negativity of Coherent Information) Show that 
the following inequality holds for any separable state p : 



{l{A)B) pAB ,I{B)A) pAB } < 0. (11.206) 



max 
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11.9.4 Continuity of Quantum Entropy 

Suppose that two density operators p and a are close in trace distance. We might then 
expect several properties to hold: the fidelity between them should be close to one and their 



entropies should be close. Theorem |9.3.1| states that the fidelity is close to one if the trace 
distance is small. 

An important theorem below, the Alicki-Fannes' inequality, states that conditional quan- 
tum entropies are close as well. This theorem usually finds application in a proof of a con- 
verse theorem in quantum Shannon theory. Usually, the specification of any good protocol 
(in the sense of asymptotically vanishing error) involves placing a bound on the trace dis- 
tance between the actual state resulting from a protocol and the ideal state that it should 
produce. The Alicki-Fannes' inequality then allows us to translate these statements of error 
into informational statements that bound the asymptotic rates of communication in any 
good protocol. We give the full proof of this theorem below, and ask you to prove variations 
of it in the exercises below. 

Theorem 11.9.4 (Alicki-Fannes Inequality). For any states p AB and a AB with \\p AB — o~ AB \\-, < 

\H(A\B) p - H(A\B) a \ <4elogd A + 2H 2 (e), (11.207) 

where H 2 (e) is the binary entropy function. 

Proof. Suppose that \\p AB — o~ AB L = e and e < 1 (we are really only concerned with small 
e). Let p AB and a denote the following density operators: 

pAB = h p AB_ a AB\ (1L208) 

e 

~AB _ l _^±(AB _ a AB} + ~AB (n 2Q9) 

e 

(You should verify that both are indeed valid density operators!). We introduce a classical- 
quantum state j XAB : 

^xab _ (1 _ e )| )(0| x ® p AB + e|l)(l| X <g> p AB , (11.210) 

Consider that 

7 AB = Tr x {~f XAB } = (1 - e)p AB + ep AB . (11.211) 

The following crucial equivalence holds as well: 

7 AB = (1 _ e )a AB + ed AB , (11.212) 



by examining the definition of a AB in (11.209). Thus, we can mix the states p AB and p AB 



and the states a AB and a AB in the same proportions to get the state r ) AB . We now prove 
the following inequality: 

'H(A\B) p -H(A\B) y \ <2elogd A + H 2 (e). (11.213) 
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We first prove that H(A\B) - H(A\B) < 2e\ogd A + H 2 (e). The following inequality holds 



because the conditional entropy is concave (Exercise 11.7.4): 

H(A\B) 7 >(l-e)H(A\B) p + eH(A\B) p . 
The above inequality implies the following one: 

H(A\B) p - H(A\B) 7 < e(H(A\B) p - H(A\B). 

< 2e log d A 

< 2elog d A + H(e). 



(11.214) 

(11.215) 

(11.216) 
(11.217) 



The second inequality holds because logdyi is the largest that each of H(A\B) and H(A\B) 



can be (Theorem 11.5.1). We now prove the other bound in the absolute value in (11.213): 
H{A\B) — H(A\B) < 2elogdA + H(e). Concavity of quantum entropy (Exercise 11.6.9) 
implies the following inequality: 



H(B\ > (1 - e)H(B) p + eH (B)- p . 

Consider that the following inequality holds 

H(AB) i < H(ABX) 7 , 

because the addition of a classical system can only increase the entropy. Then 

H{AB) 1 <H{AB\X) 1 + H{X) 1 

= (l-e)H(AB) p + eH(AB), + H 2 (e). 



Combining (11.218) and (11.221) gives the following inequality: 

H{A\B) 1 < (1 - e)H(A\B) p + eH{A\B) p + H 2 (e) 
=> H{A\B) i - H(A\B) p < e(H{A\B)~ - H(A\B) p ) + H 2 (e) 

<2e\ogd A + H 2 (e). 

By the same line of reasoning, the following inequality holds 

H(A\B) a -H(A\B) i <2elogd A + H 2 (e). 



(11.218) 

(11.219) 

(11.220) 
(11.221) 

(11.222) 
(11.223) 
(11.224) 

(11.225) 



We can now complete the proof of Theorem |11.9.4| Consider the following chain of inequal- 
ities: 



\H{A\B) P - H{A\B) a \ < \H{A\B) P - H{A\B)^\ 

<4e\ogd A + 2H 2 (e). 



\H(A\B) a -H(A\B) 



(11.226) 
(11.227) 



The first inequality follows from the triangle inequality and the second follows from (11.213) 



and (11.225). 



□ 
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A corollary of the above theorem is Fannes' inequality, which provides a better upper 
bound on the entropy difference between two states p and a. 

Theorem 11.9.5 (Fannes' Inequality). For any p and a with \\p — a\\\ < e, the following 
inequality holds: 

\H{p)-H{a)\ <2elogd + 2# 2 (e). (11.228) 

Exercise 11.9.5 Prove Fannes' inequality. (Hint: You can exploit the proof of the Alicki- 
Fannes' inequality.) 

A slight (and optimal) improvement of the above is due to Audenaert and is known as 
the Fannes- Audenaert inequality: 

Theorem 11.9.6 (Fannes-Audenaert Inequality). For any p and a withT = \\\p — c||i, the 
following inequality holds: 

\H{p)-H{a)\ <Tlog{d-l) + H 2 (T). (11.229) 

Exercise 11.9.6 (Alicki-Fannes' inequality for Coherent Information) Prove that 

\I(A)B) p - I{A)B) a \ < 4elog d A + 2H 2 (e), (11.230) 

for any p AB and a AB with \\p AB - cr AB \\ l < e. 

Exercise 11.9.7 (Alicki-Fannes' inequality for Quantum Mutual Information) 

Prove that 

\I(A-B) p -I(A-B) a \ <6elogd A + 4H 2 (e), (11.231) 

for any p AB and a AB with \\p AB - cr AB \\ 1 < e. 

11.9.5 The Uncertainty Principle in the Presence of Quantum 
Memory 



The uncertainty principle reviewed in Section 3.4.2 aims to capture a fundamental feature of 



quantum mechanics, namely, that there is an unavoidable uncertainty in the measurement 
outcomes of incompatible (non-commuting) observables. This uncertainty principle is a 
radical departure from classical intuitions, where, in principle, it seems as if there should not 
be any obstacle to measuring incompatible observables such as position and momentum. 

Though, the uncertainty principle that we reviewed before (the standard version in most 
textbooks) suffers from a few deficiencies. First, the measure of uncertainty used there is 
the standard deviation, which is not just a function of the probabilities of measurement 
outcomes but also of the values of the outcomes. Thus, the values of the outcomes may 
skew the uncertainty measure (though, one could always relabel the values in order to avoid 
this difficulty). More importantly though, from an information-theoretic perspective, there 
is not a clear operational interpretation for the standard deviation as there is for entropy. 



Second, the lower bound in (3.111 ) depends not only on the observables but also the state. In 
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Exercise |3.4.5[ we saw how this lower bound can vanish for a state even when the distributions 
corresponding to the measurement outcomes in fact do have uncertainty. So, it would be 
ideal to separate this lower bound into two terms: one which depends only on measurement 
incompatibility and another which depends only on the state. 

Additionally, it might seem as if giving two parties access to a maximally entangled 
state allows them to defy the uncertainty principle (and this is what confounded Einstein, 
Podolsky, and Rosen after quantum mechanics had been established). Indeed, suppose that 

Alice and Bob share a Bell state |$+) = 2" 1 / 2 (|00) + |11)) = 2" 1 / 2 (|++) + | )). If Alice 

measures the Pauli Z observable on her system, then Bob can guess the outcome of her 
measurement with certainty. Also, if Alice were instead to measure the Pauli X observable 
on her system, then Bob would also be able to guess the outcome of her measurement with 
certainty, in spite of the fact that Z and X are incompatible observables. So, a revision 
of the uncertainty principle is clearly needed to account for this possibility, in the scenario 
where Bob shares a quantum memory correlated with Alice's system. 

The uncertainty principle in the presence of quantum memory is such a revision that 
meets all of the desiderata stated above. It quantifies uncertainty in terms of von Neumann 
entropy rather than with standard deviation, and it also accounts for the scenario in which 
an observer has a quantum memory correlated with the system being measured. So, suppose 
that Alice and Bob share systems A and B, respectively, that are in some state p AB . If Alice 
performs a POVM {A^} on her system A, then the post-measurement state is as follows: 

a XB = 5>}<:r| X ® Tr A {(A^ <g> I B )p AB ). (11.232) 

X 

In the above classical-quantum state, the measurement outcomes x are encoded into or- 
thonormal states {\x)} of the classical register X, and the probability for obtaining outcome 
x is Tr{(A^ ® I B )p AB }. We would like to quantify the uncertainty that Bob has about 
the outcome of the measurement, and a natural quantity for doing so is the conditional 
quantum entropy H(X\B) a . Similarly, starting from the state p AB , Alice could choose to 
measure some other POVM {T z } on her system A. In this case, the post-measurement state 
is as follows: 

r ZB = J>)(*|* ® Tr A {(T A ® I B )p AB }, (11.233) 

z 

with a similar interpretation as before. We could also quantify Bob's uncertainty about 
the measurement outcome z in terms of the conditional quantum entropy H(Z\B) . We 
define Bob's total uncertainty about the measurements to be the sum of both entropies: 
H(X\B) a + H(Z\B) T . We will call this the uncertainty sum, in analogy with the uncertainty 



product in (3.111) 



We stated above that it would be desirable to have a lower bound on the uncertainty 
sum consisting of a measurement incompability term and a state-dependent term. One way 
to quantify the incompatibility for the POVMs {A x } and {P z } is in terms of the following 
quantity: 



max 

x.z 



A.a/T 



(11.234) 
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where IHI^ is the infinity norm of an operator (for the finite-dimensional case, H^H^ is 
just the maximal eigenvalue of |A|). To grasp an intuition for this incompatibility measure, 
suppose that {A^} and {T z } are actually von Neumann measurements with one common 
element. In this case, it follows that c = 1, so that the measurements are regarded as 
maximally compatible. On the other hand, if the measurements are of Pauli observables X 
and Z, these are maximally incompatible for a two-dimensional Hilbert space and c = 1/2. 
We now state the uncertainty principle in the presence of quantum memory: 

Theorem 11.9.7 (Uncertainty Principle with Quantum Memory). Suppose that Alice and 
Bob share a state p AB and that Alice performs either of the POVMs {A^} or {1^} on her 
share of the state (with at least one of {A x } or {T z } being a rank-one POVM). Then Bob's 
total uncertainty about the measurement outcomes has the following lower bound: 

H(X\B) a + H(Z\B) T > log 2 (l/c) + H(A\B) , (11.235) 



where the states a XB and r ZB are defined in (11.232) and (11.233), respectively, and the 



measurement incompatibility is defined in (11.234) 



Interestingly, the lower bound given in the above theorem consists of both the mea- 
surement incompatibility and the state-dependent term H(A\B) . As we know from Exer- 



cise 



11.9.4 



when the conditional quantum entropy H(A\B) becomes negative, this implies 
that the state p AB is entangled (but not necessarily the converse). Thus, a negative con- 
ditional entropy implies that the lower bound on the uncertainty sum can become lower 
than log 2 (l/c), and furthermore, that it might be possible to reduce Bob's total uncertainty 
about the measurement outcomes down to zero. Indeed, this is the case for the example we 
mentioned before with measurements of Pauli X and Z on the maximally entangled Bell 
state. One can verify for this case that log(l/c) = 1 and H(A\B) = — 1, so that this is 
consistent with the fact that H(X\B) a + H(Z\B) T = for this example. We now give a path 
to proving the above theorem (leaving the final steps as an exercise). 

Proof. We actually prove the following uncertainty relation instead: 

H(X\B) a + H(Z\E) w > log 2 (l/c), (11.236) 

where u) ZE is a classical-quantum state of the following form: 

■ ZE ~= 5>}(,| Z ® Tr AB {(r^ ® I BE )cf> ABE }, (11.237) 



u) 



and (j) ABE is a purification of p AB . We leave it as an exercise to demonstrate that the above 
uncertainty relation implies the one in the statement of the theorem whenever T A is a rank- 
one POVM. Consider defining the following isometric extensions of the measurement maps 
for {A x } and {T z }: 



yA-^xx'A = Y^\xf ® \xf <g> y/X^, (11.238) 

x 
yA^ZZ'A _ J2\ Z ) Z ® \ Z f ® yj? s , (11.239) 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



11.9. QUANTUM INFORMATION INEQUALITIES 303 



where {\x)} and {\z)} are both orthonormal bases. Let u zz ABE denote the following state: 

\u) zz ' ABE = V A - zz ' A \<\> p ) ABE , (11.240) 

so that lu ze = Tt Z iab{u zz ' abe }. Now consider that 

H(Z\E) w = -H{Z\Z>AB) W , (11.241) 



so that (11.236) is equivalent to 

- H(Z\Z'AB) U > log 2 (l/c) - H{X\B) a . (11.242) 



Recalling the result of Exercise 11.8.3 we then have that the above is equivalent to 

d(u zz>ab II I z ®lu z ' ab ) >log 2 (l/c) + D(a XB || I z ®a B ), (11.243) 

where we observe that o B = u B . So we aim to prove the above inequality. Consider the 
following chain of inequalities: 

D(u zz ' AB \\I z ®uj z ' ab ) (11.244) 

>D(u zz ' AB II V r Vl(l z ®w z,AB )VrV2) (11.245) 

= D{y^ zz ' AB V v II V±(l z ®u z ' AB )Vr) (11.246) 

= D(p AB (I V}(l z ®u z ' AB )v T ) (11.247) 

= D(v A p AB Vl (I V A V^(l z ® u z ' AB )v r Vi) (11.248) 

The first inequality follows from monotonicity of quantum relative entropy under the map 
p — ► Ilpn + (7 — n)p(7 — II), where the projector n = VrVp. The first equality follows 
from invariance of quantum relative entropy under isometries. The second equality follows 
from the fact that V^u zz ' AB Vr = p AB . The third equality again follows from invariance of 
quantum relative entropy under isometries. Let us define a xx ABE as 

\a) XX ' ABE = V A - xx ' A \^ p ) ABE . (11.249) 

We then have that the last line above is equal to 

D[o xx ' AB (I V A V£(l z ®u z>AB )v T v£), (11.250) 



and explicitly evaluating V A V^(l z <g> uj z ' ab ) VpVj 



as 



V A V} (l z (g> lu z ' ab ) V v Vl (11.251) 

= V^iAzf {{z'f ® ^ A y z ' AB (\zf ® V^ A )vl (11.252) 

= ^£(>r ® ^)^^ b (n) z ' ® v 7 ^)^, (ii.253) 
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gives that this is equal to 



D\o xx ' AB 



v,e 



\Z' 



r, \uj 



.Z'AB \\Z' 



*Y 



r, )v, 



(11.254) 



We trace out the X'A systems and exploit monotonicity of quantum relative entropy and 
cyclicity of trace to show that the above is not less than 



D\ a 



XB 



Using the fact that yT z k x yY z 
that the above is not less than 



J2\ X )( X \ X ® Ti Z'a{ {\z){zf <g> x/rV^x/rT 4 )^ 5 } 1 • (11.255) 

z,x / 

^ z ^/7\^\ 2 < cl and — log is operator monotone, we have 



D(a XB || c I x ®u B ) = log 2 (l/c) + D(a XB || / 



U) 



") 



log 2 (l/c) + D(a XB \\I x ®a B )., 



(11.256) 
(11.257) 



which finally proves the inequality in (11.236). We now leave it as an exercise to prove the 
statement of the theorem starting from the inequality in (11.236). □ 



Exercise 11.9.8 Prove that (11.236) implies Theorem 11.9.7. 



Exercise 11.9.9 Prove that Theorem 11.9.7 implies the following entropic uncertainty re- 
lation for a state p A on a single system: 

H(X) + H(Z) > log 2 (l/c) + H(A) p , (11.258) 

where H(X) and H(Z) are the Shannon entropies of the measurement outcomes. 



11.10 History and Further Reading 



The von Neumann entropy and its derivatives, such as the quantum conditional entropy and 
quantum mutual information, are useful information measures and suffice for our studies in 
this book. Though, the von Neumann entropy is certainly not the only information measure 
worthy of study. In recent years, entropic measures such as the min- and max-entropy have 
emerged (and their smoothed variants), and they are useful in developing a more general 
theory of quantum information that applies beyond the IID setting that we study in this 
book. In fact, one could view this theory as more fundamental than the theory presented 
in this book, since the "one-shot" results often imply the IID results studied in this book. 
Rather than developing this theory in full, we point to several excellent references on the 
subject [2071 Q2Z1 EH3 [601 EHE1 EH- 

Fannes proved the inequality that bears his name [91 j . and Audenaert later gave a sig- 
nificant improvement of the inequality [12]. Alicki and Fannes proved the inequality in 
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Theorem 11.9.4 in Ref. [9] — the proof given here is the same as their proof. The coherent 



information first appeared in Ref. |218j . where Schumacher and Nielsen proved that it obeys 
a quantum data processing inequality (this was the first clue that the coherent information 
would be an important information quantity for the transmission of quantum data through a 
noisy quantum channel). Schumacher and Westmoreland proved the bound regarding quan- 



tum relative entropy and trace distance in Theorem 11.9.2 [221] . Lieb and Ruskai proved 
the strong subadditivity of quantum entropy |184] . 

Entropic uncertainty relations have a long and interesting history. We do not review this 
history here but instead point to the survey article [245j . After this survey appeared, there 
has been much interest in entropic uncertainty relations, with the most notable advance 
being the entropic uncertainty relation in the presence of quantum memory [38J . The proof 
that we give for Theorem 11.9.7 is the same as that in Ref. [56], which in turn exploits ideas 
from Ref. [239] . 
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CHAPTER 12 



The Information of Quantum 

Channels 



We introduced several classical and quantum entropic quantities in Chapters 10 and 11 
entropy, conditional entropy, joint entropy, mutual information, relative entropy, and condi- 
tional mutual information. Each of these entropic quantities is static, in the sense that each 
is with respect to random variables or quantum systems that certain parties possess. 

In this chapter, we introduce several dynamic entropic quantities for channels, whether 
they be classical or quantum. We derive these measures by exploiting the static measures 
from the two previous chapters. We send part of a system through a channel, compute a 
static measure with respect to the input-output state, and maximize the static measure over 
all possible systems that we can transmit through the channel. This process then gives rise 
to a dynamic measure that quantifies the ability of a channel to preserve correlations. For 
example, we could send half of a pure entangled state \(f>) through a quantum channel 
J\f A ~^ B — this transmission gives rise to some noisy state Af A ~' B ((j) AA ). We would then take 
the mutual information of the resulting state and maximize the mutual information over all 
such pure input states: 

maxI(A;B) ArA >^ B ,, AA <y (12.1) 

<f> AA ' 

The above quantity is a dynamic information measure of the channel's abilities to preserve 



correlations — Section 12.4 introduces this quantity as the mutual information of the channel 

N. 

For now, we simply think of the quantities in this chapter as measures of a channel's abil- 
ity to preserve correlations. Later, we show that these quantities have explicit operational 
interpretations in terms of a channel's ability to perform a certain task, such as the transmis- 
sion of classical or quantum informational] Such an operational interpretation gives meaning 
to an entropic measure — otherwise, it is difficult to understand a measure in an information- 
theoretic sense without having a specific operational task to which it corresponds. 



living operational interpretations to informational measures is in fact the main goal of this book! 
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Recall that the entropy obeys an additivity property for any two independent random 
variables X\ and X 2 : 

H(X 1 ,X 2 ) = H(X 1 ) + H(X 2 ). (12.2) 

The above additivity property extends to a large sequence Xi, . . . ,X n of independent and 



identically distributed random variables. That is, applying (12.2) inductively shows that a 
simple formula nH(X) is the entropy of the sequence: 

n n 

H{X U ...,X n ) = J2 H{X % ) = J2 H{X) = nH(X), (12.3) 

i=l i=l 

where random variable X has the same distribution as all of Xi, . . . , X n . Similarly, quantum 
entropy is additive for any two quantum systems in a product state p® a: 

H(p®a) = H{p) + H{a), (12.4) 



and applying (12.4) inductively to a sequence of quantum states gives the following sim- 
ilar simple formula: H(p® n ) = nH(p). Additivity is a desirable property and a natural 
expectation that we have for any measure of information on independent systems. 

In analogy with the static measures, we would like additivity to hold for the dynamic 
information measures. Without additivity holding, we cannot really make sense of a given 
measure because we would have to evaluate the measure on a potentially infinite number 
of independent channel uses. This evaluation on so many channel uses is an impossible 
optimization problem. Additionally, the requirement to maximize over so many uses of the 
channel does not identify a given measure as a unique measure of a channel's ability to 
perform a certain task. As we see later, there could be other measures that are equal to the 
original one when we take the limit of many channel uses. Thus, a measure does not have 
much substantive meaning if additivity does not hold. 

We devote this chapter to the discussion of several dynamic measures. Additivity holds in 
the general case for only two of the dynamic measures presented here: the mutual information 
of a classical channel and the mutual information of a quantum channel. For all other 
measures, there are known counterexamples of channels for which additivity does not hold. 
In this chapter, we do not discuss the counterexamples, but instead focus only on classes of 
channels for which additivity does hold, in an effort to understand it in a technical sense. 
The proof techniques for additivity exploit many of the ideas introduced in the two previous 
chapters and give us a chance to practice with what we have learned there on one of the 
most important problems in quantum Shannon theory. 

12.1 Mutual Information of a Classical Channel 

Suppose that we would like to determine how much information we can transmit through a 
classical channel J\f. Recall our simplified model of a classical channel J\f from Chapter [2j 
in which some conditional probability density Py\x{v\ x ) models the effects of noise. That is, 
we obtain some random variable Y if we input a random variable X to the channel. 

©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



12.1. MUTUAL INFORMATION OF A CLASSICAL CHANNEL 309 



What is a good measure of the information throughput of this channel? The mutual 
information is perhaps the best starting point. Suppose that random variables X and Y 
are Bernoulli. If the classical channel is noiseless and X is completely random, the input 
and output random variables X and Y are perfectly correlated and the mutual information 
I(X]Y) is equal to one bit, implying that the sender can transmit one bit per transmission 
as we would expect. If the classical channel is completely noisy (in the sense that it prepares 
an output that is constant irrespective of the input), the input and output random variables 
are independent and the mutual information is equal to zero bits. This result matches 
our intuition that the sender should not be able to transmit any information through this 
completely noisy channel. 

In the above model for a classical channel, the conditional probability density Py\x{v\x) 
remains fixed, but we can "play around" with the input random variable X by modifying its 
probability density px(x)n Thus, we still "have room" for optimizing the mutual information 
of the channel J\f by modifying this input density. This gives us the following definition: 

Definition 12.1.1 (Mutual Information of a Classical Channel). The mutual information 
/(TV) of the classical channel M is as follows: 

I(J\f)=maxI(X;Y). (12.5) 

12.1.1 Regularization of the Mutual Information of a Classical 
Channel 

We now consider whether exploiting multiple uses of a classical channel TV and allowing for 
correlations between its inputs can increase its mutual information. That is, suppose that 
we have two independent uses of a classical channel TV available. Let X\ and X2 denote 
the input random variables to the respective first and second copies of the channel, and 
let Y\ and Y 2 denote the output random variables. Each of the two uses of the channel are 
equivalent to the mapping PY\x(y\x) so that the channel uses are independent and identically 
distributed. Let TV <S> M denote the tandem channel that corresponds to the mapping 

PY 1 ,Y 2 \Xi,X 2 (Vl,y2\xi,X 2 ) = PY 1 \X 1 (yi\xi)pY 2 \X 2 (y2\x 2 ), (12.6) 

where both PYi\Xi(yi\xi) and PY 2 \x 2 {y2\x2) are equivalent to the mapping p Y \x{y\x). The 
mutual information of a classical tandem channel is as follows: 

I(J\f®Af) = max I(X 1 ,X 2 ;Y 1 ,Y 2 ). (12.7) 

Px l ,x 2 (xi,x 2 ) 

We might think that we could increase the mutual information of this classical chan- 
nel by allowing for correlations between the inputs to the channels through a correlated 
distribution px 1 .x 2 (x\,X2). That is, there could be some superadditive effect if the mutual 



2 Recall the idea from Section 2.2.4 where Alice and Bob actually choose a code for the channel randomly 
according to the density px{x). 
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Figure 12.1: The above figure displays the scenario for determining whether the mutual information of 
two classical channels A/i and A/2 is additive. The question of additivity is equivalent to the possibility of 
classical correlations being able to enhance the mutual information of two classical channels. The result 
proved in Theorem |12.1.1| is that the mutual information is additive for any two classical channels, so that 
classical correlations cannot enhance it. 



information of the classical tandem channel TV <8> A/" is strictly greater than two individual 
mutual informations: 



I{M® M) >2I{M). 



(12.8) 



Figure |12.1| displays the scenario corresponding to the above question. 

In fact, we can take the above argument to its extreme, by defining the regularized mutual 
information / reg (7V) of a classical channel as follows: 



WA0= lim -l(N® n ). 

In the above definition, the quantity /(A/" 8 " 1 ) is as follows: 

/(AT ™) = max I(X n ;Y n ), 

p x n(x n ) 

J\f® n denotes n channels in tandem with mapping 

n 
p Y n\X*(y n \x n ) = Y[pY t \Xi(Vi\Xi), 

i=l 



(12.9) 



(12.10) 



(12.11) 



where X n = X\, X 2 , . . . , X n , x n = x±, 22, • • • , x n , and Y n = Y\, Y 2 , . . . , Y n . The potential 
superadditive effect would have the following form after bootstrapping the inequality in 



(12.8) to the regularization: 



/ reg (A0 > /(AT). 



;i2.i2) 



Exercise 12.1.1 Determine the maximum value of I Teg (J\f) when taking the limit. Thus, 
this quantity is finite. 
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The next section shows that the above strict inequalities do not hold for a classical 
channel, implying that no such superadditive effect occurs for its mutual information. In 
fact, the mutual information of a classical channel obeys an additivity property that is the 
cornerstone of our understanding of classical information theory. This additivity property 
implies that 

/(AT® A/") = 21 (AO (12.13) 

and 

/ reg (A0 = /(AT), (12.14) 

by an inductive argument. Thus, classical correlations between inputs do not increase the 
mutual information of a classical channel. 

We are stressing the importance of additivity in classical information theory because 
recent research has demonstrated that superadditive effects can occur in quantum Shannon 



theory (see Section 19.5, for example). These quantum results imply that our understanding 
of quantum Shannon theory is not yet complete, but they also demonstrate the fascinating 
possibility that quantum correlations can increase the information throughput of a quantum 
channel. 

12.1.2 Additivity 

The mutual information of classical channels satisfies the important and natural property of 
additivity. We prove the strongest form of additivity that occurs for the mutual information 
of two different classical channels. Let A/i and A/2 denote two different classical channels 
corresponding to the respective mappings PYxiXxiyilxi) and Py 2 \x 2 {v '2^2), and let A/i ® A/2 
denote the tandem channel that corresponds to the mapping 

Py 1 ,y 3 \x 1 ,x 2 (Vi,V2\xi,x 2 ) = PY 1 \x 1 (yi\xi)pY 2 \x 2 (y2\x 2 )- (12.15) 

The mutual information of the tandem channel is then as follows: 

/(A/i <g> A/" 2 ) = max I(X 1 ,X 2 )Y 1 ,Y 2 ). (12.16) 

Px 1 ,x 2 (xi,x 2 ) 

The following theorem states the additivity property. 

Theorem 12.1.1 (Additivity of Mutual Information of Classical Channels). The mutual 
information of the classical tandem channel A/i <8> A/2 is the sum of their individual mutual 
informations: 

/(A/i <8> A/2) = /(A/i) + /(A/" 2 ). (12.17) 

Proof. We first prove the inequality /(A/i ® A/2) > /(A/i) + /(A/2). This inequality is more 
trivial to prove than the other direction. Let p* x (x\) and p* x (x 2 ) denote the distributions 
that achieve the respective maximums of /(A/i) and /(A/2). The joint probability distribution 
for all input and output random variables is then as follows: 

Px lt x 2 ,Y u Y 2 (x l} x 2} y l} y 2 ) = P Xl {x 1 )p X2 (x 2 )p Yl \x 1 (yi\ x i)PY 2 \x 2 (y2\x 2 )- (12.18) 
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Observe that X\ and Y\ are independent of X 2 and Y 2 . Then the following chain of inequal- 
ities holds: 

I{Hx) + /(A/2) = /(Xi; Fi) + I(X 2 ; Y 2 ) (12.19) 

= H{Y X ) - H(Y X \X X ) + H{Y 2 ) - H(Y 2 \X 2 ) (12.20) 

= H(Y 1 ,Y 2 ) - H(Y 1 \X 1 ,X 2 ) - H(Y 2 \X 2 ,X l ,Y l ) (12.21) 

= H(Y 1 ,Y 2 )-H(Y 1 ,Y 2 \X 1 ,X 2 ) (12.22) 

= I(X 1 ,X 2 ;Y 1 ;Y 2 ) (12.23) 

</(M<8»M). (12.24) 

The first equality follows by evaluating the mutual informations /(A/i) and /(A/2) with 
respect to the maximizing distributions Px (zi) an d PjkO^)- The second equality follows 
by expanding the mutual informations. The third equality follows because H(Y X ) + H(Y 2 ) = 
H(Y X ,Y 2 ) when random variables Y x and Y 2 are independent, H(Y 1 \X 1 ,X 2 ) = H(Y 1 \X 1 ) 
when Yi is independent of X 2 , and ^(l^l^,^,!^) = ^(l^l^) when Y 2 is independent 



of X\ and Yi. The fourth equality follows from the entropy chain rule in Exercise |10.3.2 
The next equality follows again by expanding the mutual information. The final inequality 
follows because the input distribution p* x (x\)p* X2 (x 2 ) is a particular input distribution of 
the more general form Px 1 ,x 2 ( x 1j x 2) needed in the maximization of the mutual information 
of the tandem channel A/i <8> N 2 - We now prove the non-trivial inequality I(N\ ® A/2) < 
/(A/i) + /(A/2). Let p* Xi x 2 ( x ii x 2) denote the distribution that maximizes /(A/i (8) A/2), and 
let 

qx 1 \x 2 ( x i\ x 2 ) and qx 2 ( x 2 ) (12.25) 

be distributions such that 

P*x u x 2 ( x ii x 2) =qx 1 \x 2 ( x i\ x 2 )qx 2 ( x 2 )- (12.26) 

Recall that the mapping for the tandem channel A/i ® A/2 is as follows: 

PY lt Y a \x u Xa(Vi,V2\xi,x 2 ) = Py 1 \x 1 {Vx\ x i)Py 2 \x 2 (y2\x 2 ). (12.27) 

By summing over y 2 , we observe that Y\ and X 2 are independent because 

PY 1 \x u x 2 (yi\xi,x 2 ) =p Yl \x 1 (yi\xi). (12.28) 

Also, the joint distribution px 1 ,Yi,Y 2 \x 2 (xi, yi,y 2 \x 2 ) has the form 

Px u Y u Y 2 \x 2 (xi,yi,y 2 \x 2 ) =PY 1 \x 1 (Vi\xi)qx 1 \x a (xi\x 2 )pY a \x 9 (y2\x2)- (12.29) 

Then Y 2 is conditionally independent of X± and li when conditioning on X 2 . Consider the 
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following chain of inequalities: 



I(Af 1 ®N' 2 ) = I(X h X 2 ;Y l] Y 2 ) 

= H(Y l ,Y 2 )-H(Y l ,Y 2 \X l ,X 2 ) 

= H(Y U Y 2 )-H(Y 1 \X 1 ,X 2 )-H(Y 2 \Y U X U X 2 ) 

= H(Y h Y 2 ) - H{Y 1 \X 1 ) - H(Y 2 \X 2 ) 

< HCY,) + H(Y 2 ) - HCY^X,) - H{Y 2 \X 2 ) 

= I(X 1 ;Y 1 )+I(X 2 ;Y 2 ) 

</(M) + /(AT 2 ). 



(12.30) 

(12.31) 
(12.32) 
(12.33) 
(12.34) 
(12.35) 
(12.36) 



The first equality follows from the definition of I{M\ ® A/2) in (12.16) and by evaluating 
the mutual information with respect to the distributions p* Xi X2 (xi,x 2 ), pyi|XiG/i|#i)j an d 
Py 2 \x 2 (2/2 ^2) • The second equality follows by expanding the mutual information I(Xi, X 2 ; Yi; Y 2 ) 
The third equality follows from the entropy chain rule. The fourth equality follows because 
H(Yi\Xi,X 2 ) = H(Yi\Xi) when Y\ is independent of X 2 as pointed out in (12.27). Also, the 
equality follows because H(Y 2 \Yi,Xi,X 2 ) = H(Y 2 \X 2 ) when Y 2 is conditionally independent 
of X\ and Yi as pointed out in (12.29). The first inequality follows from subadditivity of en- 



tropy (Exercise 10.3.3). The last equality follows from the definition of mutual information, 



and the final inequality follows because the marginal distributions for X\ and X 2 can only 
achieve a mutual information less than the respective maximizing marginal distributions for 
J(M) and I(N 2 ). □ 



A simple corollary of Theorem 12.1.1 



is that correlations between input random vari- 
ables cannot increase the mutual information of a classical channel. The proof follows by 



a straightforward induction argument. Thus, the single-letter expression in (12.5) for the 



mutual information of a classical channel suffices for understanding the ability of a classical 
channel to maintain correlations between its input and output. 

Corollary 12.1.1. The regularized mutual information of a classical channel is equal to its 
mutual information: 



I reg {M) = I{M). 



Proof. We prove the result using induction on n, by showing that I(J\f' 



®n\ 



(12.37) 

nI(N) for all n, 

implying that the limit in (12.9) is not necessary. The base case for n = 1 is trivial. Suppose 
the result holds for n: I(Af® n ) = nI(N). The following chain of equalities then proves the 
inductive step: 

I(M® n+1 ) = I(N <g> N® n ) (12.38) 

= I{M) + I{M® n ) (12.39) 

= I(Af) + nI(Af). (12.40) 

The first equality follows because the channel J\f® n+1 is equivalent to a tandem of AT and 
J\f® n . The second critical equality follows from the application of Theorem 12.1.1 because 
the distributions of J\f and J\f' 
induction hypothesis. 



factorize as in (12.27) 



The final equality follows from the 

□ 
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12.1.3 Optimizing the Mutual Information of a Classical Channel 



The definition in (12.5) seems like a suitable definition for the mutual information of a classi- 



cal channel, but how difficult is the maximization problem that it sets out? Theorem |12.1.2 
below states an important property of the mutual information I(X;Y) that allows us to 
answer this question. Suppose that we fix the conditional density Py\x(v\x), but can vary 



the input density px(x) — this scenario is the same as the above one. Theorem 12.1.2 below 
proves that the mutual information I(X; Y) is a concave function of the density px(x). In 
particular, this result implies that the channel mutual information I(J\f) has a unique global 
maximum, and the optimization problem is therefore a straightforward computation that 
can exploit convex optimization methods. 

Theorem 12.1.2. Suppose that we fix the conditional probability density Py\x{v\x)- Then 
the mutual information I(X;Y) is concave in the marginal density px(x): 

XI(X 1 ; Y) + (l- X)I(X 2 ; Y) < I(Z; Y), (12.41) 

where random variable X\ has density px^x), X 2 has density px 2 (x), and Z has density 
\p Xl (x) + (1 - \)px 2 {x). 

Proof. Let us fix the density PY\x(y\x). The density Py(v) is a linear function of px(x) 
because py(y) = J2 x Py\x(y\ x )Px{ x )- Thus H(Y) is concave in px(x). Recall that the con- 
ditional entropy H(Y\X) = Y^ x Px(x)H(Y\X = x). The entropy H(Y\X = x) is fixed when 
the conditional probability density Py\x(v\x) is fixed. Thus, H(Y\X) is a linear function 
of px{x). These two results imply that the mutual information I(X;Y) is concave in the 
marginal density px{x) when the conditional density Py\x{v\x) is fixed. □ 

12.2 Private Information of a Wiretap Channel 

Suppose now that we extend the above two-user classical communication scenario to a three- 
user communication scenario, where the parties are Alice, Bob, and Eve. Alice would like 
to communicate to Bob while keeping her messages private from Eve. The channel J\f in 
this setting is the wiretap channel, corresponding to the following conditional probability 
density: 

Py,z\x(v,z\x). (12.42) 

Alice has access to the input random variable X , Bob receives output random variable Y , 



and Eve receives the random variable Z. Figure 12.2 depicts this setting. 

We would like to establish a measure of information throughput for this scenario. It 
might seem intuitive that it should be the amount of correlations that Alice can establish 
with Bob, less the correlations that Eve receives: 

I(X;Y)-I(X;Z). (12.43) 

But Alice can maximize over all possible coding strategies on her end (all possible probability 
distributions Px{x)). This leads us to the following definition: 
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Figure 12.2: The setting for the classical wiretap channel. 



Definition 12.2.1 (Private Information of a Wiretap Channel). The private information 
P(AT) of a classical wiretap channel is as follows: 



P(JV) = m&xI(X;Y) - I(X;Z). 

Px{x) 



(12.44) 



We should note that the above definition of the private information is not the most general 
formula — we could include a preprocessing step with the Markov chain U — > X — ► (Y,Z), 
but we stick with the above definition for simplicity. 

It is possible to provide an operational interpretation of the private information, showing 
that it is indeed the private capacity of the wiretap channel, but we do not do that here. 
We instead focus on the additivity properties of the private information P(J\f). 

One may wonder if the above quantity is positive, given that it is the difference of two 
mutual informations. Positivity does hold, and a simple proof demonstrates this fact. 



Property 12.2.1 The private information P(Af) of a wiretap channel is positive: 

P(A/") > 0. 



(12.45) 



Proof. We can choose the density px(x) in the maximization of P(J\f) to be the degenerate 
distribution px(x) = 5 XtXo for some realization Xq. Then both mutual informations I(X;Y) 
and I(X; Z) vanish, and their difference vanishes as well. The private information P(N) can 
only then be greater than or equal to zero because the above choice Px(%) is a particular 
choice of the density px (x) and P(Af) requires a maximization over all such distributions. □ 

12.2.1 Additivity of Private Information for Degraded Wiretap 
Channels 

It is difficult to show that the private information of general wiretap channels is additive, 
but it is straightforward to do so for a particular type of wiretap channel, called a physically 
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Figure 12.3: The above figure displays the scenario for determining whether the private information of 
two classical channels A/"i and A/2 is additive. The question of additivity is equivalent to the possibility of 
classical correlations being able to enhance the private information of two classical channels. The result 
proved in Theorem |12 .2 . 1| is that the private information is additive for two degraded wiretap channels, so 
that classical correlations cannot enhance the private information in this case. 



degradable wiretap channel. A wiretap channel is physically degradable if X, Y, and Z form 
the following Markov chain: X — > Y — > Z. That is, there is some channel Pz\y(z\v) that 
Bob can apply to his output to simulate the channel pz\x(z\%) to Eve: 



Pz\x(z\x) =Pz\Y{z\y)p Y \x{y\x). 



(12.46) 



This condition allows us to apply the data processing inequality to demonstrate that the 



private information of degraded wiretap channels is additive. Figure |12.3| displays the sce- 
nario corresponding to the analysis involved in determining whether the private information 
is additive. 



Theorem 12.2.1 (Additivity of Private Information of Degraded Wiretap Channels). The 
private information of the classical tandem channel A/i <8> A/2 is the sum of their individual 
private informations: 

P(M ® A/" 2 ) = P(A/"i) + P(A/ 2 ). (12.47) 

Proof. The inequality P(A/i <8> A/2) > P(A/i) + P(A/^) is trivial and we leave it as an exercise 
for the reader to complete. We thus prove the non-trivial inequality for the case of degraded 
wiretap channels: P(A/i <8> A/2) < P(A/i) +P(N<z)- Let p* Xi X2 ( x ii x 2) be the distribution that 
maximizes the quantity P(A/i <8> A/2). The channels are of the following form: 

PY 1 ,Z 1 \X 1 {VU Zl\x 1 )pY 2 ,Z 2 \X 2 (y2, Z 2 \X 2 ) 

= Pz 1 \Yi(Zl\Vl)PY 1 \X 1 (Vl\Xl)Pz a \Y a (Z2\V2)PY 2 \X a (V2\x2). (12.48) 
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Observe then that Y\ and Z\ are independent of X 2 . Also, Y 2 and Z 2 are independent of Yi, 
Zx, and X\. Then the following chain of inequalities holds: 

P(M (8) A/" 2 ) 

= I(X 1 X 2 ; Y 1 Y 2 ) - I(X 1 X 2 ; Z X Z 2 ) (12.49) 

= H(Y X Y 2 ) - H{Y 1 Y 2 \X 1 X 2 ) - E{Z X Z 2 ) + E{Z 1 Z 2 \X 1 X 2 ) (12.50) 

= H(Y 1 Y 2 ) - H(Y 1 \X 1 X 2 ) - HiY^XiXJ 

- U{Z X Z 2 ) + E{Z X \X X X 2 ) + H(Z 2 \Z 1 X 1 X 2 ) (12.51) 
= #(1^) - H(Y 1 \X 1 ) - H{Y 2 \X 2 ) 

- U{Z X Z 2 ) + E{Z X \X X ) + tf(Z 2 |X 2 ) (12.52) 
= H(Y 1 ) + #(Y 2 ) - I{Y X ; Y 2 ) - H(Y 1 \X 1 ) - H{Y 2 \X 2 ) 

- H{Z X ) - H{Z 2 ) + I(Z i; Z 2 ) + U{Z X \X X ) + H(Z 2 \X 2 ) (12.53) 

< H(Y t ) + iJ(F 2 ) - ^(FilXO - H{Y 2 \X 2 ) 

- H{Z X ) - H(Z 2 ) + H{Z 1 \X l ) + iJ(Z 2 |X 2 ) (12.54) 
= I(X 1 ; Y x ) - I{X X - Z0 + 7(X 2 ; F 2 ) - /(X 2 ; Z 2 ) (12.55) 

< P(A/i) + P(AT 2 ) (12.56) 

The first equality follows from evaluating P(N\ <8> A/" 2 ) on the maximizing distribution^ x (^i> ^2; 
The second equality follows by expanding the mutual informations. The third equality fol- 
lows from the entropy chain rule, and the fourth follows because Y\ and Z\ are independent 
of X 2 and Y 2 and Z 2 are independent of Y\, Z\, and X\. The fifth equality follows by re- 
placing the joint entropies with the sum of the marginal entropies reduced by the mutual 
information. The important inequality follows because there is a degrading map from Y\ 
to Z\ and from Y 2 to Z 2 , implying that I(Yi;Y 2 ) > I(Zi,Z 2 ) by the data processing in- 
equality. The last equality follows by combining the entropies into mutual informations, and 
the final inequality follows because these information quantities must be less than the same 
information quantities evaluated with respect to the maximizing distributions. □ 



An analogous notion of degradability exists in the quantum setting, and Section |12.5 
demonstrates that degradable quantum channels have additive coherent information. The 
coherent information of a quantum channel is a measure of how much quantum information 
a sender can transmit through that channel to a receiver and thus is an important quantity 
to consider for quantum data transmission. 

Exercise 12.2.1 Show that the sum of the individual private informations can never be 
greater than the private information of the classical tandem channel: 

P(M ® M 2 ) > P(A/i) + P{N 2 ). (12.57) 

12.3 Holevo Information of a Quantum Channel 

We now turn our attention to the case of dynamic informational measures for quantum 
channels, and we begin with a measure of classical correlations. Suppose that Alice would 
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like to establish classical correlations with Bob, and she wishes to exploit a quantum channel 
to do so. Alice can prepare an ensemble {px(x), p x } in her laboratory, where the states p x 
are acceptable inputs to the quantum channel. She keeps a copy of the classical index x in 
some classical register X. The expected density operator of this ensemble is the following 
classical-quantum state: 

p XA ' = Y,Px{x)\x){x\ x ®p^. (12.58) 

X 

Such a preparation is the most general way that Alice can correlate classical data with a 
quantum state to input to the channel. Let p XB be the state that arises from sending the 
A 1 system through the quantum channel J\f ~^ B : 

p XB = Y,Px{x)\x)(x\ X ®M A '- B {p*). (12.59) 

X 

We would like to determine a measure of the ability of the quantum channel to preserve 



classical correlations. We can appeal to ideas from the classical case in Section 12.1, while 



incorporating the static quantum measures from Chapter [11] A good measure of the input- 
output classical correlations is the Holevo information of the above classical quantum state: 
I(X; B) . This measure corresponds to a particular preparation that Alice chooses, but 
observe that she can prepare the input ensemble in such a way as to achieve the highest 
possible correlations. Maximizing the Holevo information over all possible preparations 
gives a measure called the Holevo information of the channel. 

Definition 12.3.1 (Holevo Information of a Quantum Channel). The Holevo information 
of the channel is a measure of the classical correlations that Alice can establish with Bob: 



X {U)=m^I{X-B) (12.60) 

p XA' P 

where the maximization is over all input ensembles. 

12.3.1 Additivity of the Holevo information for Specific Channels 

The Holevo information of a quantum channel is generally not additive. The question of 
additivity for this case is not whether classical correlations can enhance the Holevo infor- 
mation, but it is rather whether quantum correlations can enhance it. That is, Alice can 

A' A' 

choose an ensemble of the form {px (x), px 1 2 } for input to two uses of the quantum channel. 

A' A' 

The conditional density operators px 1 2 can be entangled and these quantum correlations 
can potentially increase the Holevo information. 

The question of additivity of the Holevo information of a quantum channel was a long- 
standing open conjecture in quantum information theory — many researchers thought that 
quantum correlations would not enhance it and that additivity would hold. But recent 
research has demonstrated a counterexample to the additivity conjecture, and perhaps un- 
surprisingly in hindsight, this counterexample exploits maximally entangled states to demon- 



strate superadditivity (see Section 19.5). Figure 12.4 displays the scenario corresponding to 



the question of additivity of the Holevo information. 
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Figure 12.4: The above figure displays the scenario for determining whether the Holevo information of 
two quantum channels A/i and A/2 is additive. The question of additivity is equivalent to the possibility of 
quantum correlations being able to enhance the Holevo information of two quantum channels. The result 
proved in Theorem |12.3.1| is that the Holevo information is additive for the tensor product of an entanglement- 
breaking channel and any other quantum channel, so that quantum correlations cannot enhance the Holevo 
information in this case. This is perhaps intuitive because an entanglement breaking channel destroys 
quantum correlations in the form of quantum entanglement. 



Additivity of Holevo information may not hold for all quantum channels, but it is possible 
to prove its additivity for certain classes of quantum channels. One such class for which 
additivity holds is the class of entanglement- breaking channels, and the proof of additivity 
is perhaps the simplest for this case. 

Theorem 12.3.1 (Additivity of the Holevo information of an EB Channel). Suppose that a 
quantum channel M\ is entanglement-breaking. Then the Holevo information x(A/i ® A/2) of 
the tensor product channel M\ <8> H% is the sum of the individual Holevo informations x(A/"i) 
and x(M)- 

x {M l <g> M 2 ) = x(M) + x(A/" 2 ). (12.61) 

Proof. The trivial inequality x(A/i <8> A/2) > x(A/i) + xCA/2) holds for any two quantum 
channels A/"i and A/2 because we can choose the input ensemble on the LHS to be a tensor 
product of the ones that individually maximize the terms on the RHS. We now prove the 
nontrivial inequality x(A/i ® A/2) < x(A/i) + x{Nq) that holds when A/i is entanglement- 
breaking. Let p XB i B ? be the state that maximizes the Holevo information x(A/i <8> A/2), 
where 



-,XB 1 B 2 



„XA[A' 2 



rA'^B 



CA/T 1 ®^ 5 



.4;- 



'W 



J2px(x)\x)(x\ x ®p^ A * 



(12.62) 
(12.63) 



The action of A/i is to break entanglement. Let p XBlA 2 be the state after only the entanglement- 
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breaking channel A/i acts. We can write this state as follows: 

p XB lA > 2 _ N A^B^ pXAIiK) ( ^ ^ 

= J2px(x)\x)(x\ x ®N?^ B \p^) (12.65) 

X 

= J2px(x)\x)(x\ X ® J2Pv\x(y\x) <l ® e$j, (12.66) 

x y 

= ^r P Y\x(y\x)p x (x)\x)(x\ X ® a% ® ^|. (12.67) 

The third equality follows because the channel A/i breaks any entanglement in the state 
Px 1 2 , leaving behind a separable state J2 y PY\x(y\x) a Bl y <g) B x %. Then the state p XBlB2 has 
the form 

p x Bl B 2 = J2 P Yix(y\x)px(x)\x)(x\ x ® a*: v ®N^ B \e^). (12.68) 

x,y 
Let u) XYBlB2 be an extension of p XB ^ B2 where 

,xyb iB2 _ Y^Pv\x{y\x)px{x)\x){x\ x ® |y)(y| Y ® o£ s A/^ 2 ^), (12.69) 






and Tty{u> XYBiB2 } = p XB ^ B2 . Then the following chain of inequalities holds: 

X(M ® AT 2 ) = J(X; BiSa)^ (12.70) 

= /(X;BiB 2 ) w (12.71) 

</(^;Bi5 2 L (12.72) 

= i/( J B lJ B 2 ) w -i/( J B lJ B 2 |XF) w (12.73) 

= H{B l B 2 ) uj - H{B 1 \XY) w - H(B 2 \XY) u (12.74) 

< H(B 1 ) U + ff(5 2 ) w - H(B X \XY) U - H{B 2 \XY) u (12.75) 

= I(XY;B l ) u; + I(XY;B 2 ) u] (12.76) 

<X(M)+X(AT 2 ). (12.77) 

The first equality follows because p XB ^ B2 is the state that maximizes the Holevo information 
x(A/i ® A/2) of the tensor product channel A/i ® A/2. The second equality follows because the 
reduced state of & i B2 on systems X , Si, and -B2 is equal to p x 2 . The first inequality 
follows from the quantum data processing inequality. The third equality follows by expanding 
the mutual information I(XY; BiB?)^. The fourth equality is the crucial one that exploits 



the entanglement-breaking property. It follows by examining (12.68) and observing that the 
state uj XYBiB2 on systems Bi and B 2 is product when conditioned on classical variables X 
and Y . The second inequality follows from subadditivity of entropy. The last equality follows 
from straightforward entropic manipulations, and the final inequality follows because u XYBl 
is a particular state of the form needed in the maximization of x(A/i), and the same holds 
for the state u XYB2 . U 
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Corollary 12.3.1. The regularized Holevo information of an entanglement-breaking quan- 
tum channel Af is equal to its Holevo information: 



XregW = XW- (12.78) 



Proof. The proof of this property uses the same induction argument as in Corollary 12.1.1 



and exploits the additivity property in Theorem 12.3.1 above. □ 



12.3.2 Optimizing the Holevo Information 
Pure States are Sufficient 



The following theorem allows us to simplify the optimization problem that (12.60) sets out— 
we show that it is sufficient to consider ensembles of pure states at the input. 

Theorem 12.3.2. The Holevo information is equivalent to a maximization over only pure 
states: 

x{Af) = m&xI(X;B) = maxI(X;B), (12.79) 

pXA' T XA' 

where 

r XA ' = J2px(x)\x)(x\ x ® |^}(0,| A ', (12.80) 

x 

and t xb is the state that results from sending the A 1 system of the above state through the 
quantum channel M A ^ B . 



Proof. Suppose that p XA is any state of the form in (12.58). Consider a spectral decompo- 
sition of the states p A : 

Px = *E,PY\x(v\x)il£ y , (12.81) 

y 

where the states ip A are pure. Then let a XYA ' denote the following state: 

a XYA> _ J2p Yl x(y\x) Px (x)\x){x\ x ® \y)(y\ Y ® < y , (12.82) 

x 

so that Tr Y {o- XYA '} = p XA ' . Also, observe that a XYA ' is a state of the form t xa ' with XY 
as the classical system. Let a XYB denote the state that results from sending the A' system 
through the quantum channel J\f A '^ B . Then the following relations hold 

I(X;B) p = I(X;B) a (12.83) 

<I{XY-B) a (12.84) 

The equality follows because Tvy{o- xyb } = p XB and the inequality follows from the quantum 
data processing inequality. It then suffices to consider ensembles with only pure states 
because the state a XYB is a state of the form t xb with the combined system XY acting as 
the classical system. □ 
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Concavity in the Distribution and Convexity in the Signal States 

We now show that the Holevo information is concave as a function of the input distribution 
when the signal states are fixed. 

Theorem 12.3.3. The Holevo information I(X; B) is concave in the input distribution when 
the signal states are fixed, in the sense that 



XI(X; B) aQ + (1 - X)I(X; B) < I(X; B) a , 



XB 



where a^ and a x are of the form 



a XB = J2 P x(x)\x){x\ X 0Af(a x ), 

X 

a XB = J2<lx(x)\x)(x\ X ® M{a x ), 



and o~ XB is a mixture of the states o~ XB and o~ XB of the form: 

a XB = J2i X Px( x ) + C 1 " X)Qx(x)]\x)(x\ x <B> M(a x ), 



(12.85) 

(12.86) 
(12.87) 

(12.88) 



where < A < 1. 

Proof. Let a XUB be the state 

Mxb _ J2 \p x (x)\x)(x\ x (8> AIOXOI' 17 + ^(^larXx^ <8> (1 - A)|l)(l| l/ j ® N{a x ). (12.89) 



a 



Observe that Tru{a XUB } = o XB . Then the statement of concavity is equivalent to 



I(X;B\U) a <I(X;B) a . 



We can rewrite this as 



H{B\U) a - H{B\UX) a < H{B) a - H{B\X) a . 

Observe that 

H(B\UX) a = H(B\X) a , 

i.e., one can calculate that both of these are equal to 

J2^ X Px( x ) + (1 - X)qx(x)]H(Af(a x )). 

X 

The statement of concavity then becomes 

H{B\U) C < H(B) a , 
which follows from concavity of quantum entropy. 



(12.90) 

(12.91) 
(12.92) 

(12.93) 

(12.94) 

□ 
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The Holevo information is convex as a function of the signal states when the input 
distribution is fixed. 

Theorem 12.3.4. The Holevo information I(X;B) is convex in the signal states when the 
input distribution is fixed, in the sense that 



XI(X;B) ao + (l-X)I(X;B) ai >I(X;B) 
where o~ XB and a XB are of the form 

- XB -^p x (x)\x)(x\ x ®M(a x ), 



0~n 



E: 



X 

and o~ XB is a mixture of the states o~ XB and o~ XB of the form: 

a XB = J2Px(x)\x){x\ X ®N(\a x + (1 - X)u x ), 



(12.95) 

(12.96) 
(12.97) 

(12.98) 



where < A < 1. 

Proof. Let a XUB be the state 

- ArB = J2px(x)\x)(x\ X ® A 



a 



u 



®N{a x ) + (1 - X)\l){l\ U <g> M{oj x ) 



(12.99) 



Observe that Ttu{o~ xub } = o XB . Then convexity in the input states is equivalent to the 
statement 

I(X;B\U) a >I(X;B) a . (12.100) 

Consider that 

I{X- B\U) a = I(X; BU) a - I(X; U) a , (12.101) 

by the chain rule for the quantum mutual information. Since the input distribution px(x) is 
fixed, there are no correlations between X and the convexity variable U, so that I(X; U) a = 
0. Thus, the above inequality is equivalent to 



I(X;BU) a >I(X;B) a , 
which follows from the quantum data processing inequality. 



(12.102) 

□ 



In the above two theorems, we have shown that the Holevo information is either con- 
cave or convex depending on whether the signal states or the input distribution are fixed, 
respectively. Thus, the computation of the Holevo information of a general quantum channel 
becomes difficult as the input dimension of the channel grows larger, since a local maximum 
of the Holevo information is not necessarily a global maximum. Though, if the channel 
has a classical input and a quantum output, the computation of the Holevo information is 
straightforward because the only input parameter is the input distribution, and we proved 
that the Holevo information is a concave function of the input distribution. 
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12 A Mutual Information of a Quantum Channel 

We now consider a measure of the ability of a quantum channel to preserve quantum cor- 
relations. The way that we arrive at this measure is similar to what we have seen before. 
Alice prepares some pure quantum state cf) AA ' in her laboratory, and inputs the A' system to 
a quantum channel J\f ~^ B — this transmission gives rise to the following noisy state: 

p AB =M A>^B(,AA'\ (12.103) 

The quantum mutual information I(A; B) is a static measure of quantum correlations 
present in the state p AB . To maximize the quantum correlations that the quantum chan- 
nel can establish, Alice should maximize the quantum mutual information I (A; B) over all 

possible pure states that she can input to the channel J\f A ^ B . This procedure leads to the 
definition of the mutual information I(J\f) of a quantum channel: 

/(AT) = mnxI{A;B) oAB (12.104) 

4> AA ' H 

The mutual information of a quantum channel corresponds to an important operational 
task that is not particularly obvious from the above discussion. Suppose that Alice and 
Bob share unlimited bipartite entanglement in whatever form they wish, and suppose they 
have access to a large number of independent uses of the channel J\f A ~" B . Then the mutual 
information of the channel corresponds to the maximal amount of classical information that 
they can transmit in such a setting. This setting is the noisy analog of the super-dense coding 



protocol from Chapter pj (recall the discussion in Section 6.4). By teleportation, the maximal 



amount of quantum information that they can transmit is half of the mutual information of 



the channel. We discuss how to prove these statements rigorously in Chapter 20 



12.4.1 Additivity 

There might be little reason to expect that the quantum mutual information of a quantum 
channel is additive, given that the Holevo information is not. But perhaps surprisingly, 
additivity does hold for the mutual information of a quantum channel! This result means 
that we completely understand this measure of information throughput, and it also means 
that we understand the operational task to which it corresponds (entanglement-assisted 
classical coding discussed in the previous section). 

We might intuitively attempt to explain this phenomenon in terms of this operational 
task — Alice and Bob already share unlimited entanglement between their terminals and so 
entangled correlations at the input of the channel do not lead to any superadditive effect 
as it does for the Holevo information. This explanation is somewhat rough, but perhaps 
the additivity proof explains best why additivity holds. The crucial inequality in the proof 



follows from three applications of the strong subadditivity inequality (Theorem 11.9.1) and 



one application of subadditivity (Corollary 11.8.1). This highlights the importance of strong 



subadditivity in quantum Shannon theory. Figure 12.5 illustrates the setting corresponding 



to the analysis for additivity of the mutual information of a quantum channel. 
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Figure 12.5: The above figure displays the scenario for determining whether the mutual information of 
two quantum channels A/i and A/2 is additive. The question of additivity is equivalent to the possibility of 
quantum correlations between channel inputs being able to enhance the mutual information of two classical 
channels. The result proved in Theorem |12.4.1| is that the mutual information is additive for any two quantum 
channels, so that quantum correlations cannot enhance it. 



Theorem 12.4.1 (Additivity of Quantum Mutual Information of Quantum Channels). Let 
A/1 and A/2 be any quantum channels. Then the mutual information of the tensor product 
channel A/i <8> A/2 is the sum of their individual mutual informations: 



/(Al®A 2 ) = /(A' 1 ) + /(A'2). 



;i2.105) 



Proof. We first prove the trivial inequality I(M\ <g> A/2) > /(A/i) + /(A/2). Let (j) AlA i and 
x j J MA 2 k e ^ e states the maximize the respective mutual informations /(A/i) and /(A/2). Let 
Uj^f 1 ^ 1 1 and Uj^ 2 ~' 2 2 denote the respective isometric extensions of A/i and A/2. The states 
4> and ip then lead to a state ip where 



'r 



A 1 A 2 B 1 B 2 E 1 E 2 



A' 1 ^B 1 E 1 



u£f B2E2 )(<f) A ^®^ A2A >). 



;i2.106) 



Observe that the state Tte 1 e 2 { ( p} is a particular state of the form required in the maximiza- 
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tion of I {Mi <8> A/2), by taking A = AiA 2 . Then the following inequalities hold: 



I{Mi) + i{M 2 ) = 7(A i; BOmw + 7 ^ 52 W 

+ H{B 2 ) M2W - H{B 2 \A 2 ) M2W 
= H{B 1 B 2 ) v - H{B l \A 1 A 2 ) v ~ H(B 2 \A 2 A 1 B 1 ) v 
= H(B 1 B 2 ) v -H(B 1 B 1 \A 1 A 2 )^ 
= I{A l A 2 -B l B 2 ) v 
<I{Mi®M 2 ). 



(12.107) 

(12.108) 
(12.109) 
(12.110) 
(12.111) 
(12.112) 



The first equality follows by evaluating the mutual informations I {Mi) and /(A/2) with 
respect to the maximizing states (f) AlAl and ^} A " lA ^. The second equality follows from the 



expansion of mutual information in (11.73). The third equality follows because H{B\) + 
H(B 2 ) = H{B 1 B 2 ) when the quantum state ip on systems Bi and B 2 is in a product state, 
H(Bi\AiA 2 ) = H(Bi\Ai) when the state <p on B\ and A\ is product with respect to A 2 (see 
Exercise 11.4.1), and H(B 2 \A 2 AiB 1 ) when the state ip on B 2 and A 2 is product with respect 
to A\ and B\ (again see Exercise 11.4.1 ). The fourth equality follows from the entropy chain 
rule. The fifth equality follows again from the expansion of mutual information. The final 
inequality follows because the input state (p AlAl <g> tf) A2A z is a particular input state of the 



more general form 






needed in the maximization of the quantum mutual information 



of the tensor product channel Mi <8> M 2 . Notice that the steps here are almost identical 



to the ones in the classical proof of Theorem |12.1.1| with the exception that we use the 
quantum generalization of the classical properties. We now prove the non-trivial inequality 
I {Mi <8> A/2) < I{Mi)+I{M 2 ). Let cf) AA i A 2 be the state that maximizes the mutual information 
I {Mi ®M 2 ) and let 



AB 1 E 1 A' 2 



r ^B lEl uAA < iA2 

T A'^B 2 Eo 



u^—{r A '^), 



a 

qAA' 1 B 2 E 2 — jjA 2 ^B 2 E 2 , iAA^A'^ 



lAB 1 E 1 B 2 E 2 



v& 



A '^ B ^ ®U*f B * E *){<j> AA '^) 



(12.113) 
(12.114) 
(12.115) 
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Consider the following chain of inequalities: 

I{Ni®N2) = I{A ] B x B 2 ) (t> 

= H{A) <f> + H{B 1 B 2 ) 4> -H{AB 1 B 2 ) ( ^ 
= H(BiB 2 E 1 E2) ( f > + H(B 1 B 2 ) ( / ) — H(E 1 E 2 
= H(B 1 B 2 \E 1 E 2 ) 4> + H(B 1 B 2 ) C ^ 
<H(B 1 \E 1 )^ + H(B 3 \E 2 )^ + H(B 1 \ 
= H(B 1 E 1 ) 4> + H(B 1 ) 4> -H(E l ) ( ^ 

+ H(B 2 E 2 )^ + H(B 2 )^-H(E 2 )^ 
= H(AA' 2 ) a + H(B 1 ) a - H(AA' 2 B 1 ) a 

+ H(AA[) g + H(B 2 ) g - H(AA[B 2 ) 
= I(AA' 2 ;B l ) a + I(AA' 1 ;B 2 ) e 
<I(Mx) + I(M 2 ). 



H(B 2 \ 



(12.116) 
(12.117) 
(12.118) 
(12.119) 
(12.120) 

(12.121) 

(12.122) 
(12.123) 
(12.124) 



The first equality follows from the definition of /(A/i <S> A/2) in (12.104) and evaluating 
I(A; B\B 2 ) with respect to the maximizing state (/). The second equality follows by ex- 
panding the quantum mutual information. The third equality follows because the state (/) 
on systems A, Bi, B 2 , E\, and E 2 is pure. The fourth equality follows from the definition 



of conditional quantum entropy in (11.4.1). The first inequality is the crucial one that leads 



to additivity. It follows from three applications of strong subadditivity (Theorem 11.7.1) 
to obtain H{B 1 B 2 \E X E 2 ) < H{B 1 \E 1 ) + H(B 2 \E 2 ) and subadditivity (Theorem |11.6.1| ) to 
obtain H{B\B 2 ) < H(B\) + H(B 2 ). The fifth equality follows by expanding the conditional 
quantum entropies H{B x \Ei) and H{B 2 \E 2 ). The sixth equality follows because the state 6 
on systems A, A' 2 , B x , and E\ is pure, and the state a on systems A, A[, B 2 , and E 2 is pure. 
The last equality follows from the definition of quantum mutual information, and the final 
inequality follows because the states 9 and a are particular states of the form needed in the 
respective maximizations of I(N\) and /(A/2). □ 

Corollary 12.4.1. The regularized mutual information of any quantum channel is equal to 
its mutual information: 



I reg (Af) = I (AT). 



(12.125) 



Proof. The proof of this property uses the same induction argument as in Corollary |12.1.1 
and exploits the additivity property in Theorem 12.4.1 above. □ 

Exercise 12.4.1 (Alternate Mutual Information of a Quantum Channel) Let p XAB 
denote a state of the following form: 



^XAB 



\X 



®N A ^ B {(t>T) 



;i2.126) 



Consider the following alternate definition of the mutual information of a quantum channel: 



/ alt (A0 = max/(AX;5), 



(12.127) 
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where the maximization is over states of the form p XAB . Show that 

4it(A0 = I(Af). (12.128) 

Exercise 12.4.2 Compute the mutual information of a dephasing channel with dephasing 
parameter p. 

Exercise 12.4.3 Compute the mutual information of an erasure channel with erasure parameter e. 

Exercise 12.4.4 (Pure States are Sufficient) Show that it is sufficient to consider pure 
state 4> AA for determining the mutual information of a quantum channel. That is, one does 
not need to consider mixed states p AA in the optimization task. (Hint: use the spectral de- 



composition, the quantum data processing inequality, and apply the result of Exercise 12.4.1 ) 



12.4.2 Optimizing the Mutual Information of a Quantum Channel 

We now show that the mutual information of a quantum channel is concave as a function 
of the input state. This result allows us to compute this quantity with standard convex 
optimization techniques. 

Theorem 12.4.2. The mutual information I(A;B) is concave in the input state, in the 
sense that 

J2Px(x)I(A;B) px <I(A;B) a , (12.129) 

X 

where p AB = Af A '^ B ((j) AA '), a A ' = ^ x Px{x)p A ' , cj) AA ' is a purification of a A ' , and a AB = 

Proof. Let p XABE be the following classical-quantum state: 

p xABE _ J2p x (x)\x){x\ x ® U^ BE (cj> AA '), (12.130) 

X 

where U$^ BE is the isometric extension of the channel. Consider the following chain of 
inequalities: 

J2Px(x)I(A;B) px = I(A;B\X) p (12.131) 

X 

= H{A\X) p + H{B\X) p - H{AB\X) p (12.132) 

= H{BE\X) p + H{B\X) p - H{E\X) p (12.133) 

= H(B\EX) p + H(B\X) p (12.134) 

<H(B\E) p + H(B) p (12.135) 

= H(B\E) a + H(B) a (12.136) 

= I(A;B) a (12.137) 
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The first equality follows because the conditioning system X in I (A; B\X) is classical. The 
second equality follows by expanding the quantum mutual information. The third equality 
follows because the state on ABE is pure when conditioned on X. The fourth equality 
follows from the definition of conditional quantum entropy. The inequality follows from 
strong subadditivity and concavity of quantum entropy. The equality follows by inspecting 
the definition of the state a, and the final equality follows because the state is pure on 
systems ABE. D 

12.5 Coherent Information of a Quantum Channel 

This section presents an alternative, important measure of the ability of a quantum channel 
to preserve quantum correlations: the coherent information of the channel. The way we 
arrive at this measure is similar to how we did for the mutual information of a quantum 
channel. Alice prepares a pure state (j) AA and inputs the A' system to a quantum channel 
J\f ^ B . This transmission leads to a noisy state p AB where 

p AB = M A'^B (.AA'Y (12.138) 

The coherent information of the state that arises from the channel is as follows: 

I(A)B) p = H(B) p -H(AB) p , (12.139) 

leading to our next definition. 

Definition 12.5.1 (Coherent Information of a Quantum Channel). The coherent informa- 
tion Q(N) of a quantum channel is the maximum of the coherent information over all input 
states: 

Q(N) = maxI(A)B). (12.140) 

The coherent information of a quantum channel corresponds to an important operational 
task (perhaps the most important for quantum information). It is a good lower bound on 
the ultimate rate at which Alice can transmit quantum information to Bob, but it is actually 
equal to the quantum communication capacity of a quantum channel in some special cases. 



We prove these results rigorously in Chapter 23 



Exercise 12.5.1 Let I c (p,N) denote the coherent information of a channel J\f when state 
p is its input: 

I c (p,M) = H{M{p)) - H{N c {p)), (12.141) 

where A/" c is a channel complementary to the original channel A/". Show that 

Q(A0=maxI c (p,A0- (12.142) 

p 
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An equivalent way of writing the above expression on the RHS is 

max\H(B)^-H(E)J } (12.143) 

0AA' L r v } 

where \ip) = Ufa ~* \(f>) and Ufy^ B is the isometric extension of the channel AT. 

The following property points out that the coherent information of a channel is always 
positive, even though the coherent information of any given state can sometimes be negative. 

Property 12.5.1 (Non-negativity of Channel Coherent Information) The coherent 
information Q(Af) of a quantum channel Af is non-negative: 

Q(Af) > 0. (12.144) 

Proof. We can choose the input state (j) AA to be a product state of the form ip A <g) ip A . The 
coherent information of this state vanishes: 

I{A)B) i>Am{ipA , ) = H{B) M{tpAl) - H(AB)^ N{vAl) (12.145) 

= H{B) mvAl) - H(A) fA - H{B) M{ipAl) (12.146) 

= 0. (12.147) 

The first equality follows by evaluating the coherent information for the product state. The 
second equality follows because the state on AB is product. The last equality follows because 
the state on A is pure. The above property then holds because the coherent information of 
a channel can only be greater than this amount, given that it involves a maximization over 
all input states and the above state is a particular input state. □ 

12.5.1 Additivity of Coherent Information for Degradable Chan- 
nels 

The coherent information of a quantum channel is generally not additive for arbitrary quan- 
tum channels. You might potentially view this situation as unfortunate, but it implies that 
quantum Shannon theory is a richer theory than its classical counterpart. Attempts to un- 
derstand why and how this quantity is not additive have led to many breakthroughs (see 



Section 23.7). 

Degradable quantum channels form a special class of channels for which the coherent 
information is additive. These channels have a property that is analogous to a property of 



the degraded wiretap channels from Section 12.2 To understand this property, recall that any 



quantum channel Af has a complementary channel (Af c ) ~* , realized by considering 

the isometric extension of the channel and tracing over Bob's system. 

Definition 12.5.2 (Degradable Quantum Channel). A degradable quantum channel is one 
for which there exists a degrading map T B ^ E so that for any input state p A : 



(Af c ) A '^ E (p A ') = T B - E (N A '- B (p A ')). (12.148) 
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The intuition behind a degradable quantum channel is that the map from Alice to Eve 
is noisier than the map from Alice to Bob, in the sense that Bob can simulate the map to 
Eve by applying a noisy map to his system. We can show additivity of coherent information 
for these channels by exploiting a technique similar to that in the proof of additivity for 
degraded wiretap channels. The picture to consider for the analysis of additivity is the same 
as in Figure 12.5[ 



Theorem 12.5.1 (Additivity of the Coherent Information of Degradable Quantum Chan- 
nels). Let A/1 and J\f 2 be any quantum channels that are degradable. Then the coherent 
information of the tensor product channel A/i <8> A/2 is the sum of their individual coherent 
informations: 

Q(M ® A/" 2 ) = Q(M) + Q(A/i). (12.149) 



Proof. We leave the proof of the inequality Q{N\ <8> A/ 2 ) > Q{M\) + Q{N 2 ) as Exercise [12.5.3 
below, and we prove the non-trivial inequality Q(A/i <8> A/2) < <5(A/i)+Q(A/"2) that holds when 
quantum channels A/i and A/2 are degradable. Consider a pure state AA 'i A 2 that serves as 
the input to the two quantum channels. Let UjJ^ 1 1 denote the isometric extension of the 
first channel and let U \? ~" 1 2 denote the isometric extension of the second channel. Let 



'Mi 



a' 



AB.E.A' _ tt.itt] 



UjfJUfo, (12.150) 

d AA>B 2 E 2 _ U^juh, (12-151) 

p ab iEi b 2 e 2 _ ([/m g, u^Jjj^ g, tft. Y (12.152) 



We need to show that <5(A/i <S> A/2) = <2(A/i) + Q{N 2 ) when both channels are degradable. 
Furthermore, let p AB ^ E ^ B ^ E ^ be the state that maximizes Q(A/i <8> A/2). Consider the following 
chain of inequalities: 

Q(A/i ® A/" 2 ) = I(A)B 1 B 2 ) p (12.153) 

= H(B 1 B 2 ) p - H(AB 1 B 2 ) p (12.154) 

= H(B 1 B 2 ) p -H(E 1 E 2 ) p (12.155) 
= ff (BO, - B(B0 p + #(B 2 ) P - #(£ 2 ) p 

/(S i;j B 2 ) p -/(E i;j E2)J (12.156) 

< B(B0 p " ^ + ^(^ 2 ) p - #(£ 2 ) p (12.157) 

= ff (B^ - B^B^ + ^(,82)^ - H(AA[B 2 ) e (12.158) 

= /(M)^i) ff + HAA[)B 2 ) 9 (12.159) 

<Q(A1) + Q(A" 2 ). (12.160) 

The first equality follows from the definition of Q{M\ <S> A/2) and because we set p to be the 
state that maximizes the tensor product channel coherent information. The second equality 
follows from the definition of coherent information, and the third equality follows because 
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the state p is pure on systems AB1E1B2E2. The fourth equality follows by expanding the 
entropies in the previous line. The first inequality (the crucial one) follows because there is 
a degrading channel from both B\ to E\ and B 2 to E2, allowing us to apply the quantum 
data processing inequality twice to get I(Bi,B 2 ) > I(Ei,E 2 ) . The fifth equality follows 
because the entropies of p, a, and 6 on the given reduced systems are equal and because 
the state a on systems AA' 2 B\Ei is pure and the state 9 on systems AA' 1 B 2 E 2 is pure. The 
last equality follows from the definition of coherent information, and the final inequality 
follows because the coherent informations are less than their respective maximizations over 
all possible states. □ 

Corollary 12.5.1. The regularized coherent information of a degradable quantum channel 
is equal to its coherent information: 



QregW = QW- (12.161) 



Proof. The proof of this property uses the same induction argument as in Corollary 12.1.1 



and exploits the additivity property in Theorem 12.5.1 above. □ 



Exercise 12.5.2 Consider the quantum erasure channel where the erasure paramater e is 
such that < e < 1/2. Find the channel that degrades this one and compute the coherent 
information of the erasure channel as a function of e. 

Exercise 12.5.3 (Superadditivity of Coherent Information) Show that the coherent 
information of the tensor product channel A/i <8> A/2 is never less than the sum of their 
individual coherent informations: 

Q{Hi ® N 2 ) > Q{Mi) + Q{N 2 ). (12.162) 

Exercise 12.5.4 Prove using monotonicity of relative entropy that the coherent information 
is subadditive for a degradable channel: 

Q(M) + QCAfa) < <9(M ® N 2 ) (12.163) 

Exercise 12.5.5 Consider a quantity known as the reverse coherent information: 

Q rev (A0 = maxI(B)A) mAA , y (12.164) 

Show that the reverse coherent information is additive for any two quantum channels A/i 
and M 2 : 

Qrev(M ® M 2 ) = Q re v(M) + Qrev(AT 2 ). (12.165) 

12.5.2 Optimizing the Coherent Information of a Degradable Chan- 
nel 

We would like to determine how difficult it is to maximize the coherent information of 
a quantum channel. For general channels, this problem is difficult, but it turns out to 
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be straightforward for the class of degradable quantum channels. Theorem |12.5.2| below 
states an important property of the coherent information Q(N) of a degradable quantum 
channel TV that allows us to answer this question. The theorem states that the coherent 
information Q(J\f) of a degradable quantum channel is a concave function of the input 
density operator p A over which we maximize it. In particular, this result implies that the 
coherent information Q(N) has a unique global maximum since the set of density operators 
is convex, and the optimization problem is therefore a straightforward computation that can 
exploit convex optimization methods. The below theorem exploits the characterization of 
the channel coherent information from Exercise 112.5.11 

Theorem 12.5.2. Suppose that a quantum channel J\f is degradable. Then the coherent 
information I c (p,J\f) is concave in the input density operator: 

Y,Px(x)I c ( Px ,Af) < lJj2px(x)p x ,Af), (12.166) 

X \ X / 

where px(x) is a probability density function and each p x is a density operator. 
Proof. Consider the following states: 

a XB = Y, Px (x)\x)(x\ x ®M(p x ), (12.167) 

X 

9 XE = J2px(x)\x)(x\ x <g> (ToAf)( Px ), (12.168) 

X 

where T is degrading map of the channel J\f so that 

ToM = N c . (12.169) 

Then the following statements hold: 

I(X;B) a >I(X;E) 6 (12.170) 

.-. H(B) a -H(B\X) a >H(E) e -H(E\X) e (12.171) 

.-. H(B) a -H(E) e >H(B\X) a -H(E\X) s (12.172) 

••• H(xrhrpx(x)p x )j -h(u c \Y^p x {x) Px 

> Y,Px{x){H{M{p x )) - H(Af c ( Px ))) (12.173) 

X 

:. I c \Y^px{x)p Xl u\>Y,Px{x)I c {p x ,U) (12.174) 

\ X / X 
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The first statement is the crucial one and follows from the quantum data processing in- 
equality and the fact that the map T degrades Bob's state to Eve's state. The second and 
third statements follow from the definition of quantum mutual information and rearranging 
entropies. The fourth statement follows by plugging in the density operators into the en- 
tropies in the previous statement. The final statement follows from the alternate definition 
of coherent information in Exercise 112.5.11 □ 

12.6 Private Information of a Quantum Channel 

The private information of a quantum channel is the last information measure that we 
consider. Alice would like to establish classical correlations with Bob, but does not want the 
environment of the channel to have access to these classical correlations. The ensemble that 
she prepares is similar to the one we considered for the Holevo information. The expected 
density operator of the ensemble she prepares is a classical-quantum state of the form: 

p^'^^p x (x)|x)(x| x ®pf. (12.175) 

X 

Sending the A 1 system through the isometric extension Ufy^ BE of a quantum channel J\f 
leads to a state p XBE . A good measure of the private classical correlations that she can 
establish with Bob is the difference of the classical correlations she can establish with Bob, 
less the classical correlations that Eve can obtain: 

I(X-B)-I(X-E) o , (12.176) 



leading to our next definition (Chapter 22 discusses the operational task corresponding to 
this information quantity). 

Definition 12.6.1 (Private Information of a Quantum Channel). The private information 
P(J\f) of a quantum channel M is the maximization over all of Alice's input preparations: 

P(Af) = max/(A; B) - I(X; E) (12.177) 

p XA' P P 

Property 12.6.1 The private information P(J\f) of a quantum channel J\f is non-negative: 

P{M) > 0. (12.178) 

Proof. We can choose the input state p XA ' to be a state of the form |0}(0| <8) ip A ' , where 
ifj A is pure. The private information of this state vanishes: 

I ( x 'i ■ B )|o><o|®A/'W0 ~ J ( X ' E )\o){o\9J^ e W = °- (12.179) 

The equality follows just by evaluating both mutual informations for the above state. The 
above property then holds because the private information of a channel can only be greater 
than this amount, given that it involves a maximization over all input states and the above 
state is a particular input state. □ 
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The regularized private information is as follows: 

Pre g (A0 = lim -P(A/"® n ). (12.180) 

12.6.1 The Relationship of Private Information with Coherent In- 
formation 

The private information of a quantum channel bears a special relationship to that channel's 
coherent information. It is always at least as great as the coherent information of the channel 
and is equal to it for certain channels. The following theorem states the former inequality, 
and the next theorem states the equivalence for degradable quantum channels. 

Theorem 12.6.1. The private information P(Af) of any quantum channel AT is at least as 
large as its coherent information Q{M): 

Q{N) < P(AT). (12.181) 

Proof. We can see this relation through a few steps. Consider a pure state (j) AA that maxi- 
mizes the coherent information Q(J\f), and let (j) ABE denote the state that arises from sending 
the A' system through the isometric extension Ujj-^ B of the channel N '. Let (f) A denote 
the reduction of this state to the A' system. Suppose that it admits the following spectral 
decomposition: 

<f> A ' = ^2px(x)\<f> x ){<f> x \ A '. (12.182) 

X 

We can create an augmented classical-quantum state that correlates a classical variable with 
the index x: 

a XA ' = J2px(x)\x)(x\ X ® \cf> x ){ct> x \ A ' . (12.183) 

X 

Let o~ XB denote the state that results from from sending the A' system through the isometric 
extension Ufy^ BE of the channel N '. Then the following chain of inequalities holds: 

Q{N) = I{A)B) 4> (12.184) 

= H(B) 4> -H(E) <p (12.185) 

= H(B) a -H(E) a (12.186) 

= H{B) a - H(B\X) a - H{E) a + H(B\X) a (12.187) 

= I(X; B) a - H(E) a + H(E\X) a (12.188) 

= I(X;B) a -I(X;E) a (12.189) 

< P(J\f). (12.190) 

The first equality follows from evaluating the coherent information of the state (f) ABE that 
maximizes the coherent information of the channel. The second equality follows because the 
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state (j) is pure. The third equality follows from the definition of a XBE in (12.183) and its 
relation to </) ABE . The fourth equality follows by adding and subtracting H(B\X) a , and the 
next one follows from the definition of the mutual information I(X; B) a and the fact that 
the state of a XBE on systems B and E is pure when conditioned on X. The last equality 
follows from the definition of the mutual information I(X; E). The final inequality follows 



because the state a XBE is a particular state of the form in (12.175), and P(N) involves a 



maximization over all states of that form. □ 

Theorem 12.6.2. Suppose that a quantum channel J\f is degradable. Then its private in- 
formation P{N) is equal to its coherent information Q(Af): 

P(jV) = Q[M). (12.191) 

Proof. We prove the inequality P{N) < Q{N) for degradable quantum channels because we 
have already proven that Q(Af) < P(N) for any quantum channel TV. Consider a classical- 



quantum state p XBE that arises from transmitting the A' system of the state in (12.175) 

tA' 

'n 



through the isometric extension U6-^ BE of the channel. Suppose further that this state 



maximizes P(jV). We can take the spectral decomposition of each p£ in the ensemble to be 
as follows: 

pi' = Z>i*(*i*)iC ( 12 - 192 ) 

y 
where each state ip£ is pure. We can construct the following extension of the state p XBE 



-x,y 

as follows: 



XYBE 



Y.py\x(y\ x )px( x )\ x )( x \ x ® \y)(y\ Y ® u ^ BE «y)- ( 12 - 193 ) 



.en 



Then the following chain of inequalities holds: 

P{M) = I{X ] B) p -I{X ] E) p (12.194) 

= I(X-B) a -I(X-E) a (12.195) 

= I(XY; B) a - I(Y;B\X) a - [I(XY; E) a - I(Y; E\X)J (12.196) 

= I(XY; B) a - I(XY; E) a - [I(Y; B\X) a - I(Y; E\X) a ] (12.197) 

The first equality follows because from the definition of P(N) and because we set p to be 
the state that maximizes it. The second equality follows because p XBE =Tty{o~ xybe }. 
The third equality follows from the chain rule for quantum mutual information. The fourth 
equality follows from a rearrangement of entropies. Continuing, 

<I(XY;B) a -I(XY;E) a (12.198) 

= H(B) a - H(B\XY) a - H{E) c + H{E\XY) c (12.199) 

= H(B) a - H(B\XY) a - H(E) a + H(B\XY) a (12.200) 

= H(B) a -H(E) a (12.201) 

< Q(N). (12.202) 
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Figure 12.6: The above figure displays the scenario for determining whether the private information of 
two quantum channels A/i and A/2 is additive. The question of additivity is equivalent to the possibility of 
quantum correlations between channel inputs being able to enhance the private information of two quantum 
channels. The result proved in Theorem 12.6.3 is that the private information is additive for any two 



degradable quantum channels, so that quantum correlations cannot enhance it in this case. 



The first inequality (the crucial one) follows because there is a degrading channel from B 
to E and because the conditioning system X is classical, allowing us to apply the quantum 
data processing inequality I(Y;B\X) a > I(Y;E\X) a . The second equality is a rewriting 
of entropies, the third follows because the state of a on systems B and E is pure when 
conditioned on classical systems X and Y, and the fourth follows by canceling entropies. 
The last inequality follows because the entropy difference H(B) a — H(E) a is less than the 
maximum of that difference over all possible states. □ 



12.6.2 Additivity of the Private Information of Degradable Chan- 
nels 

The private information of general quantum channels is not additive, but it is so in the 
case of degradable quantum channels. The method of proof is somewhat similar to that in 



the proof of Theorem |12.5.1[ essentially exploiting the degradability property. Figure |12.6 
illustrates the setting to consider for additivity of the private information. 

Theorem 12.6.3 (Additivity of Private Information of Degradable Quantum Channels). 
Let A/1 and A/2 be any quantum channels that are degradable. Then the private information 
of the tensor product channel A/i <8> A/2 is the sum of their individual private informations: 



P(M ® A/" 2 ) = P(M) + PW. 



Furthermore, it holds that 



P(M ® AA 2 ) = Q(M ® A" 2 ) = Q(M) + Q{M 2 ). 



(12.203) 



(12.204) 
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Proof. We first prove the more tri vial ine quality P(A/i <S> A/2) > P(A/i) + P(A/" 2 ). Let p XlA 'i 



and a X2j4 2 be states of the form in (12.175) that maximize the respective private informations 
-P(A/i) and P(N 2 ). Let Q x i x ^ A 'i-^i be the tensor product of these two states: 9 = p <g) a. 
Let p^i^i^i and u 128 ^ 2 be the states that arise from sending p XlA i and a X2A i through the 
respective isometric extensions Uj^ 1 ^ 1 1 and U^f L '~ 2 . Let Q x i x 2 b iB 2 e 1 e 2 ^ e ^ e state that 

arises 
Then 



j n aim u N 



arises from sending fl^i^A^ through the tensor product channel Uu- 1 ^ <8> £/ "' 



P(AT 1 ) + P(AT 2 ) 
= /(X i; Bx) p - J(X i; E x ) p + /(X 2 ; P 2 ) a - I(X 2 ; E 2 ) a (12.205) 

= J(X i; B a )e " ^1; #i) fl + ^(^2; ^ 2 ) - J(X 2 ; E 2 ) e (12.206) 
= P(Px), - H(B x \X x ) e - H{E 1 ) 6 + #(£i|Xi) fl 

+ H{B 2 ) e - H{B 2 \X 2 ) e - H{E 2 ) e + H{E 2 \X 2 ) e (12.207) 
= H(B 1 B 2 ) e — H(BiB 2 \X 1 X 2 ) g 

- H(E 1 E 2 ) 9 - H{E 1 E 2 \X 1 X 2 ) B (12.208) 

= /(X 1 X 2 ; P^ - I(X 1 X 2 ; PxP 2 ) e (12.209) 

<P{Mi®M 2 ). (12.210) 

The first equality follows from the definition of the private informations -P(A/"i) and P(A/" 2 ) 
and by evaluating them on the respective states p XlA ± and a X2Al2 that maximize them. The 
second equality follows because the reduced state of q x i x 2BiB 2 e 1 e 2 on S y S tems Xi, Pi, and 
E x is equal to p x i B iE 1 ^ and the rec j UC ed state of q x i x 2BiB 2 e 1 e 2 on systems x 2 , B 2 , and 
E 2 is equal to a 2 2 2 . The third equality follows by expanding the mutual informations. 
The fourth equality follows because the state q x ^ x 2 b iB 2 e 1 e 2 j g product with respect to the 
systems X\B\Ei and X 2 B 2 E 2 . The fifth equality follows from straightforward entropic 
manipulations, and the final inequality follows because the state Q x ^ x iB\B 2 ExE 2 ^ g a p ar ticular 
state of the form needed in the maximization of the private information of the tensor product 
channel A/i <8> M 2 . We now prove the inequality P{M\ <8> M 2 ) < P(A/i) + P(A/" 2 ). Let p Xj4 i A 2 
be the state that maximizes P(A/"i <8> AT 2 ) where 

p^^p x (*)|x)<*| x ®^, (12 .2H) 

and let p XB \B 2 E 1 E 2 ^ e ^h e s tate that arises from sending p XA i A 2 through the tensor product 
channel UjJ^ 1 1 <S> Uj^ 2 ^"' 2 2 . Consider a spectral decomposition of each state p x x 2 : 



f^ = E^i*W<^ 2 > ( 12 - 212 ) 



p. 



where each state ^ x ,y 2 is pure. Let a XYAlAl2 be an extension of p XA i A 2 where 

p xy A[ A' 2 _ ^p y|x (^ x (*)|a;)(a;| X ® Ij/Xyf ® <^, (12.213) 
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and let a XYB ^ E ^ B 2 E 2 ^g the s tate that arises from sending a XYA ' lA ' 2 through the tensor product 

A l R 7? A 1 R T? 

channel Uj/ 1 1 ®U Af 2 " >2 2 . Consider the following chain of inequalities: 

P(M ® N 2 ) 

= I(X;B 1 B 2 ) p -I(X;E 1 E 2 ) p 
= I(X;B 1 B 2 ) a -I(X;E 1 E 2 ) a 
= I(XY-B 1 B 2 ) a -I(XY-E 1 E 2 ) a 

- [I(Y; B 1 B 2 \X) a - I(Y; E 1 E 2 \X) a ] 
<I(XY-B 1 B 2 ) a -I(XY;E 1 E 2 ) a 
= H(B 1 B 2 ) a - H(B 1 B 2 \XY) a - H(E 1 E 2 ) a 

= H(B 1 B 2 ) a - H(B 1 B 2 \XY) a - H{E 1 E 2 ) (T 

= H(B 1 B 2 ) a - H{E l E 2 ) u (12.221) 

= H(B l ) a - H(E 1 ) a + H{B 2 ) a - H(E 2 ) a 

-[I(B 1 ;B 2 ) a -I(E 1 ;E 2 ) \ (12.222) 

< H(B 1 ) a - H(E l ) a + H(B 2 ) a - H(E 2 ) a (12.223) 

< Q(M) + Q{Mi) (12.224) 
= P(M 1 ) + P{N 2 ). (12.225) 

The first equality follows from the definition of P(J\f\ ® A/2) and evaluating it on the state p 
that maximizes it. The second equality follows because the state a XYB i E i B 2 E 2 [ s equal to the 
state p XB i E ^ B 2 E 2 a fter tracing out the system Y. The third equality follows from the chain 
rule for mutual information: I(XY;B 1 B 2 ) = I(Y;BiB 2 \X) + I(X;B X B 2 ). It holds that 
I(Y; BiB 2 \X) a > I(Y; EiE 2 \X) a because the conditioning system X is classical and there 
is a degrading channel from B\ to E\ and from B 2 to E 2 . Then the first inequality follows 
because I(Y; BiB 2 \X) a — I(Y; ExE 2 \X) a > 0. The fourth equality follows by expanding the 
mutual informations, and the fifth equality follows because the state a on systems B\B 2 E\E 2 
is pure when conditioning on the classical systems X and Y . The sixth equality follows from 
algebra, and the seventh follows by rewriting the entropies. It holds that I[B\]B 2 ) rj > 
I(E\; E 2 ) a because there is a degrading channel from B\ to E\ and from B 2 to E 2 . Then the 
inequality follows because I(Bi\ B 2 ) a — I(Ei; E 2 ) a > 0. The third inequality follows because 
the entropy difference H(Bi) — H{Ei) is always less than the coherent information of the 
channel, and the final equality follows because the coherent information of a channel is equal 



to its private information when the channel is degradable (Theorem 12.6.2). □ 

Corollary 12.6.1. Suppose that a quantum channel M is degradable. Then the regularized 
private information P reg (J\f) of the channel is equal to its private information P(J\f): 

Preg(N) = P{M). (12.226) 

Proof. The proof follows by the same induction argument as in Corollary 12.1.1| and by 



exploiting the result of Theorem 12.6.3 and the fact that the tensor power channel J\f® n is 



degradable if the original channel A/" is. □ 
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12.7 Summary 



We conclude this chapter with a table that summarizes the main results regarding the mu- 
tual information of a classical channel I{py\x)i the private information of a classical wiretap 
channel P\Py,z\x)i the Holevo information of a quantum channel x(A/"), the mutual infor- 
mation of a quantum channel I(J\f), the coherent information of a quantum channel Q(J\f), 
and the private information of a quantum channel P(J\f). The table exploits the following 
definitions: 



„XA' 



X 






a 



XA' 



J2px(x)\x){x\ 



X 



iA' 



px(x)\x)(x\ ®<j> x , 



p. 



A' 



(12.227) 
(12.228) 



Quantity 


Input 


Output 


Formula 


Single-letter 


I{py\x) 


Px{x) 


Px(x)p Y \x(y\x) 


ma.x Px ( x )I(X;Y) 


all channels 


P{py,z\x) 


Px{x) 


Px(x)p Y ,Z\x(y,z\x) 


m^ Px(x) I{X-Y)-I{X-Z) 


degradable 


xCAO 


p XA> 


M A '- B { P XA ') 


max p /(X; B) 


many channels 


/(A0 


4> AA ' 


N A^B^AA^ 


max^i^A; B) 


all channels 


QW 


^AA- 


M A '^ B {(t) AA ') 


raaxtf, I (A) B) 


degradable 


PW 


a XA 


U^ BE (a XA ') 


max a I{X;B)-I(X;E) 


degradable 


12.8 Hist 


ory and Further Reading 





The book of Boyd and Vandenberghe is useful for the theory and practice of convex opti- 
mization J45], which is helpful for computing capacity formulas. Holevo |144] , Schumacher, 
and Westmoreland |219| provided an operational interpretation of the Holevo information of 
a quantum channel. Shor showed the additivity of the Holevo information for entanglement- 
breaking channels p26j . Adami and Cerf introduced the mutual information of a quantum 
channel, and they proved several of its important properties that appear in this chapter: 
non- negativity, additivity, and concavity |5]. Bennett et al. later gave an operational in- 
terpretation for this information quantity as the entanglement-assisted classical capacity of 
a quantum channel [331 [34] . Lloyd p.85], Shor [227J, and Devetak J6H] gave increasingly 
rigorous proofs that the coherent information of a quantum channel is an achievable rate 
for quantum communication. Devetak and Shor showed that the coherent information of 
a quantum channel is additive for degradable channels [73J. Yard et al. proved that the 
coherent information of a quantum channel is a concave function of the input state when- 
ever the channel is degradable [.269]. Garcia-Patron et al. and Devetak et al. both discussed 
the reverse coherent information of a quantum channel and showed that it is additive for 

and Cai et al. |51] independently introduced 



all quantum channels jlOOL [72] . Devetak 

the private classical capacity of a quantum channel, and both papers proved that it is an 

achievable rate for private classical communication over a quantum channel. Smith showed 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



12.8. HISTORY AND FURTHER READING 341 



that the private classical information is additive and equal to the coherent information for 
degradable quantum channels |230j . 
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CHAPTER 13 



Classical Typicality 



This chapter begins our first technical foray into the asymptotic theory of information. We 
start with the classical setting in an effort to build up our intuition of asymptotic behavior 
before delving into the asymptotic theory of quantum information. 

The central concept of this chapter is the asymptotic equipartition property. The name 
of this property may sound somewhat technical at first, but it is merely an application of 
the law of large numbers to a sequence drawn independently and identically from a distribu- 
tion px(x) for some random variable X. The asymptotic equipartition property reveals that 
we can divide sequences into two classes when their length becomes large: those that are 
overwhelmingly likely to occur and those that are overwhelmingly likely not to occur. The 
sequences that are likely to occur are the typical sequences, and the ones that are not likely 
to occur are the atypical sequences. Additionally, the size of the set of typical sequences 
is exponentially smaller than the size of the set of all sequences whenever the random vari- 
able generating the sequences is not uniform. These properties are an example of a more 
general mathematical phenomenon known as "measure concentration," in which a smooth 
function over a high-dimensional space or over a large number of random variables tends to 
concentrate around a constant value with high probability. 

The asymptotic equipartition property immediately leads to the intuition behind Shan- 
non's scheme for compressing classical information. The scheme first generates a realization 
of a random sequence and asks the question: Is the produced sequence typical or atypical? If 
it is typical, compress it. Otherwise, throw it away. The error probability of this compression 
scheme is non-zero for any fixed length of a sequence, but it vanishes in the asymptotic limit 
because the probability of the sequence being in the typical set converges to one, while the 
probability that it is in the atypical set converges to zero. This compression scheme has a 
straightforward generalization to the quantum setting, where we wish to compress qubits 
instead of classical bits. 

The bulk of this chapter is here to present the many technical details needed to make 
rigorous statements in the asymptotic theory of information. We begin with an example, 
follow with the formal definition of a typical sequence and a typical set, and prove the 
three important properties of a typical set. We then discuss other forms of typicality such 
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The Law of Large Numbers and the Sample Entropy 




10 10 10 

Length of sequence 



10 



Figure 13.1: The above figure depicts the sample entropy of a realization of a random binary sequence as 
a function of its length. The source is a binary random variable with distribution (|, j). For the realizations 
generated, the sample entropy of the sequences is converging to the true entropy of the source. 



as joint typicality and conditional typicality. These other notions turn out to be useful for 
proving Shannon's classical capacity theorem as well (recall that Shannon's theorem gives the 
ultimate rate at which a sender can transmit classical information over a classical channel 
to a receiver). We also introduce the method of types, which is a powerful technique in 
classical information theory, and apply this method in order to develop a stronger notion 
of typicality. The chapter then features a development of the strong notions of joint and 
conditional typicality and ends with a concise proof of Shannon's important channel capacity 
theorem. 



13.1 An Example of Typicality 



Suppose that Alice possesses a binary random variable X that takes the value zero with 
probability | and the value one with probability \. Such a random source might produce 
the following sequence: 

0110001101, (13.1) 



if we generate ten realizations of it. The probability that such a sequence occurs is 



1\ 5 /3 X 5 



(13.2) 
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determined simply by counting the number of ones and zeros in the above sequence and by 
applying the independent and identically distributed (HD) property of the source. 

The information content of the above sequence is the negative logarithm of its probability 
divided by its length: 

- — log ( - J - — log ( - ) « 1.207. (13.3) 



10 °V4/ 10 °V4. 
We also refer to this quantity as the sample entropy. The true entropy of the source is 

-i 108 ©"! 108 ©" - 8113 - (13 - 4) 

We would expect that the sample entropy of a random sequence tends to approach the true 
entropy as its size increases because the number of zeros should be approximately n(3/4) and 
the number of ones should be approximately n(l/4) according to the law of large numbers. 
Another sequence of length 100 might be as follows: 

00000000100010001000000000000110011010000000100000 
00000110101001000000010000001000000010000100010000, (13.5) 

featuring 81 zeros and 19 ones. Its sample entropy is 

81 /3\ 19 /1\ 

log log - « 0.7162. 13.6 

100 & U/ 100 h \A v ; 



The above sample entropy is closer to the true entropy in (13.4) than the sample entropy of 



the previous sequence, but it still deviates significantly from it. 



Figure 13. 1| continues this game by generating random sequences according to the dis- 



tribution (|, |), and the result is that a concentration around the true entropy begins to 
occur around n fa 10 6 . That is, it becomes highly likely that the sample entropy of a random 
sequence is close to the true entropy if we increase the length of the sequence, and this holds 



for the realizations generated in Figure |13.1 

13.2 Weak Typicality 



This first section generalizes the example from the introduction to an arbitrary discrete, 
finite-valued random variable. Our first notion of typicality is the same discussed in the 
example — we define a sequence to be typical if its sample entropy is close to the true entropy 
of the random variable that generates it. This notion of typicality is known as weak typicality. 



Section 13.7 introduces another notion of typicality that implies weak typicality, but the 
implication does not hold in the other direction. For this reason, we distinguish the two 
different notions of typicality as weak typicality and strong typicality. 

Suppose that a random variable X takes values in an alphabet X with cardinality \X\. 
Let us label the symbols in the alphabet as a±, CJ2, . . . , o>\x\- An independent and identically 
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distributed (IID) information source samples independently from the distribution of random 
variable X and emits n realizations Xi, . . . , x n . Let X n = X\- ■ ■ X n denote the n random 
variables that describe the information source, and let x n = x\ ■ ■ ■ x n denote an emitted 
realization of X n . The probability px n (x n ) of a particular string x n is as follows: 

p X r\x n ) =p Xl ,...,x n {xi,...,x n ), (13.7) 

and px n (x n ) factors as follows because the source is IID: 

n 

p X n( X n ) =p Xl (xi)---px n ( x n) = Px(Xl) ■ ■ ■ Px(x n ) = Y[Px( x i)- (13.8) 

i=l 

Roughly speaking, we expect a long string x n to contain about npx{ci\) occurrences of 
symbol Oi, npx{(i2) occurrences of symbol 02, etc., when n is large. The probability that the 
source emits a particular string x n is approximately 

PxAx n ) = Px ( Xl ) ■ ■■px{x n ) ~ Px(ai) npx(ai) ■ ■■px{a\ X \) npx ^\ (13.9) 

and the information content of a given string is thus roughly 

1 '*' 

- - \og{p x -{x n )) w - Vpx(a.) \og{p x {oi)) = H{X). (13.10) 

n A — ' 

i=i 

The above intuitive argument shows that the information content divided by the length of 
the sequence is roughly equal to the entropy in the limit of large n. It then makes sense to 
think of this quantity as the sample entropy of the sequence x n . 

Definition 13.2.1 (Sample Entropy). The sample entropy H(x n ) of a sequence x n is as 
follows: 

H{x n ) = --\og{p x ^x n )). (13.11) 

n 

This definition of sample entropy leads us to our first important definitions in asymptotic 
information theory. 

Definition 13.2.2 (Typical Sequence). A sequence x n is 5-typical if its sample entropy 
H(x n ) is 5-close to the entropy H(X) of random variable X, where this random variable is 
the source of the sequence. 

Definition 13.2.3 (Typical Set). The 5-typical set T* n is the set of all 5-typical sequences x n : 

Tf n = {x n : \H{x n ) - H(X)\ < 5). (13.12) 
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nH(X) 



/ 




Set of 
All Sequences 

x n 


V 




Typical 
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j 



nlog|Af| 



Figure 13.2: The above figure depicts the idea that the typical set is exponentially smaller than the set 
of all sequences because \X\ n = 2 nl ° s ' x ' > 2 nHi - x > whenever X is not a uniform random variable. Yet, this 
exponentially small set contains nearly all of the probability. 



13.3 Properties of the Typical Set 



The set of typical sequences enjoys three useful and beautifully surprising properties that 
occur when we step into the "land of large numbers." We can summarize these properties 
as follows: the typical set contains almost all the probability, yet it is exponentially smaller 
than the set of all sequences, and each typical sequence has almost uniform probability. 



Figure 13.2 attempts to depict the main idea of the typical set. 



Property 13.3.1 (Unit Probability) The typical set asymptotically has probability one. 
So as n becomes large, it is highly likely that a source emits a typical sequence. We formally 
state this property as follows: 

Ve>0 Pr{X n ET 5 X ' 1 } = Yl Px-(x n )>l-e for sufficiently large n. (13.13) 



a-£T/ 



Property 13.3.2 (Exponentially Small Cardinality) The number \T$ | of 5-typical 
sequences is exponentially smaller than the total number | X\ n of sequences for every random 
variable X besides the uniform random variable. We formally state this property as follows: 



\ 1 8 \ — Z 



We can also lower bound the size of the ^-typical set when n is sufficiently large: 
Ve > \Tf n I > (1 - e )2 n(H{x) - 5) for sufficiently large n. 



(13.14) 



;i3.15) 



Property 13.3.3 (Equipartition) The probability of a particular 5-typical sequence x n is 
approximately uniform: 



2 - n( H(x)+ S ) < px „(^) < 2 " 



■n(H{X}-8) 



(13.16) 
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This last property is the "equipartition" in "asymptotic equipartition property" because all 
typical sequences occur with nearly the same probability when n is large. 

The size \T* n \ of the 5-typical set is approximately equal to the total number \X\ n of 
sequences only when random variable X is uniform because H{X) — log | A" | and thus 

|T X "| < 2«(-ffp0+<$) _ 2 n ( lo &\ x \+ 5 ) = \x\ n . 2 n5 ~ \X\ n . (13.17) 

13.3.1 Proofs of Typical Set Properties 



Proof of the Unit Probability Property (Property 13.3.1). The weak law of large numbers states 
that the sample mean converges in probability to the expectation. More precisely, consider 
a sequence of IID random variables X\, . . . , X n that each have expectation jj. The sample 
average of this sequence is as follows: 



i n 

-J2 X - 

n < ■* 



X = - > X, (13.18) 



n 



The formal statement of the law of large numbers is 

Ve,c) >0 3n : Vn>n Px{\X - //| < 8} > 1 - e. (13.19) 

We can now consider the sequence of random variables — log(px(Xi)), . . . , — log(px(X n )). 
The sample average of this sequence is equal to the sample entropy of X n : 

1 n 1 

- J>g(px(X;)) = — log(px-(X B )) (13.20) 



n * — ' n 

4=1 



H(X n ). (13.21) 



Recall from (10.3) that the expectation of the random variable — \og(px(X)) is equal to the 
Shannon entropy: 

E x {- log( Px (X))} = H(X). (13.22) 

Then we can apply the law of large numbers and find that 

Ve, 5 > 3n :Vn>n Pr{ \~H(X n ) - H(X) \ < 6} > 1 - e. (13.23) 

The event { \H(X n ) — H(X) | < 5} is precisely the condition for a random sequence X n to be 
in the typical set T^", and the probability of this event goes to one as n becomes large. □ 



Proof of the Exponentially Small Cardinality Property (Property 13.3.2). Consider the follow- 
ing chain of inequalities: 






> V 2 _n(//(x)+5) = 2- n{ * H(x)+S) \Tf l \. (13.24) 



x«eTf n 
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The first inequality uses the fact that the probability of the typical set is smaller than 
the probability of the set of all sequences. The second inequality uses the equipartition 



property of typical sets (proved below). After rearranging the leftmost side of (13.24) with 
its rightmost side, we find that 

\Tf l \ <2< H ^ +5 \ (13.25) 

The second part of the property follows because the "unit probability" property holds for 
sufficiently large n. Then the following chain of inequalities holds: 



l-e<Pr{X n eTf}= J2 Px< xn ) 



x n ET? 



< V 2- n(H{x) - 5) = 2- n{H{x) - 5) \Tf n \. (13.26) 



, eT x« 



We can then bound the size of the typical set as follows: 

\Tf'\ >2< H W- 5 \l-e), (13.27) 

for any e > and sufficiently large n. □ 



Proof of the Equipartition Property (Property 13.3.3). The property follows immediately by 



manipulating the definition of a typical set. □ 

13.4 Application of Typical Sequences: Shannon Com- 
pression 

The above three properties of typical sequences immediately give our first application in 
asymptotic information theory. It is Shannon's compression protocol, which is a scheme for 
compressing the output of an information source. 

We begin by defining the information processing task and a corresponding (n, R, e) source 



code. It is helpful to recall the picture in Figure 2.1 An information source outputs a 
sequence x n drawn independently according to the distribution of some random variable X. 
A sender Alice encodes this sequence according to some encoding map E where 

E : X n -> {0, l} nR . (13.28) 

The encoding takes elements from the set X n of all sequences to a set {0, l} n of size 2 nR . 
She then transmits the codewords over nR uses of a noiseless classical channel. Bob decodes 
according to some decoding map D : {0, l} n — > X n . The probability of error for an (n, R, e) 
source code is 

p(e) = Pv{(D o E)(X n ) ^ X n } < e. (13.29) 

The rate of the source code is the number of channel uses divided by the length of the 
sequence, and it is equal to R for the above scheme. A particular compression rate R is 
achievable if there exists an (n, R + 5, e) source code for all e, 5 > and all sufficiently large 
n. We can now state Shannon's lossless compression theorem. 
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Figure 13.3: Shannon's scheme for the compression of classical data. The encoder / is a map from the 
typical set to a set of binary sequences of size « 2 ' ' where H(X) is the entropy of the information source. 
The map / is invertible on the typical set but maps an atypical sequence to a constant. Alice then transmits 
the compressed data over ss nH(X) uses of a noiseless classical channel. The inverse map f^ 1 (the decoder) 
is the inverse of / on the typical set and decodes to some error sequence otherwise. 



Theorem 13.4.1 (Shannon Compression). The entropy of the source is the smallest achiev- 
able rate for compression: 



inf{i? : R is achievable} = H(X). 



(13.30) 



The proof of this theorem consists of two parts, traditionally called the direct coding 
theorem and the converse theorem. The direct coding theorem is the direction LHS < 
RHS — the proof exhibits a coding scheme with an achievable rate and demonstrates that its 
rate converges to the entropy in the asymptotic limit. The converse theorem is the direction 
LHS > RHS and is a statement of optimality — it proves that any coding scheme with rate 
below the entropy is not achievable. The proofs of each part are usually completely different. 
We employ typical sequences and their properties for proving a direct coding theorem, while 
the converse part resorts to information inequalities from Chapter 1QJP For now, we prove 



the direct coding theorem and hold off on the converse part until we reach Schumacher 
compression for quantum information in Chapter 17 Our main goal here is to illustrate 



a simple application of typical sequences, and we can wait on the converse part because 
Shannon compression is a special case of Schumacher compression. 

The idea behind the proof of the direct coding theorem is simple: just compress the 
typical sequences and throw away the rest. A code of this form succeeds with asymptotically 
vanishing probability of error because the typical set asymptotically has all of the probability. 
Since we are only concerned with error probabilities in communication protocols, it makes 
sense that we should only be keeping track of a set where all of the probability concentrates. 
We can formally state the proof as follows. Pick an e > 0, a 5 > 0, and a sufficiently large n 



1 The direct part of a quantum coding theorem can employ the properties of typical subspaces (discussed 
in Chapter 14), and the proof of a converse theorem for quantum information usually employs the quantum 



information inequalities from Chapter 11 
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such that Property |13. 3. 1| holds. Consider that Property |13.3.2| then holds so that the size of 
the typical set is no larger than 2 n *- H ( x > +s >. We choose the encoding to be a function / that 
maps a typical sequence to a binary sequence in {0, l} n \sq, where R = H{X) + 5 and eo is 
some error symbol in {0, l} n . We define / to map any atypical sequence to e$. This scheme 
gives up on encoding the atypical sequences because they have vanishingly small probability. 
We define the decoding operation to be the inverse of / on the typical set, while mapping 
to some fixed sequence x n G X n if the received symbol is eo- This scheme has probability of 



error less than e, by considering Property 13.3.1 Figure 13.3 depicts this coding scheme. 



Shannon's scheme for compression suffers from a problem that plagues all results in 
classical and quantum information theory. The proof guarantees that there exists a scheme 
that can compress at the rate of entropy in the asymptotic limit. But the complexity 
of encoding and decoding is far from practical — without any further specification of the 
encoding, it could require resources that are prohibitively exponential in the size of the 
sequence. 

The above scheme certainly gives an achievable rate for compression of classical informa- 
tion, but how can we know that it is optimal? The converse theorem addresses this point 
(recall that a converse theorem gives a sense of optimality for a particular protocol) and 
completes the operational interpretation of the entropy as the fundamental limit on the 
compressibility of classical information. For now, we do not prove a converse theorem and 
instead choose to wait until we cover Schumacher compression because its converse proof 
applies to Shannon compression as well. 



13.5 Weak Joint Typicality 



Joint typicality is a concept similar to typicality, but the difference is that it applies to any 
two random variables X and Y . That is, there are analogous notions of typicality for the 
joint random variable (X,Y). 

Definition 13.5.1 (Joint Sample Entropy). Consider n independent realizations x n = 
X\ • • • x n and y n = y±- ■ ■ y n of respective random variables X and Y . The sample joint 
entropy H(x n ,y n ) of these two sequences is 

H(x\y n ) = -l\og(p xniYn (x n ,y n )), (13.31) 

n 

where we assume that the joint distribution Px n ,Y n {x n ,y n ) has the IID property: 

Px",Y"(x n ,y n ) = Px,y(xi,Vi) ■ ■ ■ Px,Y(x n ,Vn)- (13.32) 

This notion of joint sample entropy immediately leads to the following definition of joint 



typicality. Figure 13.4 attempts to depict the notion of joint typicality. 



Definition 13.5.2 (Jointly Typical Sequence). Two sequences x n ,y n are S -jointly-typical 
if their sample joint entropy H(x n , y n ) is 5-close to the joint entropy H(X, Y) of random 
variables X and Y and if both x n and y n are marginally typical. 
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Figure 13.4: A depiction of the jointly typical set. Some sequence pairs (x n , y n ) are such that x n is typical 
or such that y n is typical, but fewer are such that the pair is jointly typical. The jointly typical set has 
size roughly equal to 2 nH ^ x,Y \ which is smaller than the Cartesian product of the marginally typical sets if 
random variables X and Y are not independent. 



Definition 13.5.3 (Jointly Typical Set). The S -jointly typical set T$ Y consists of all 
5-jointly typical sequences: 



T 



X nyr, 



{x n ,y n :\H(x n ,y n )-H(X,Y)\<5, x n e T s x \ y n eTf}. 



(13.33) 



The extra conditions on the marginal sample entropies are necessary to have a sensible 
definition of joint typicality. That is, it does not necessarily follow that the marginal sample 
entropies are close to the marginal true entropies if the joint ones are close, but it intuitively 
makes sense that this condition should hold. Thus, we add these extra conditions to the 



definition of jointly typical sequences. Later, we find in Section 13.7 that the intuitive 



implication holds (it is not necessary to include the marginals) when we employ a stronger 
definition of typicality. 



13.5.1 Properties of the Jointly Typical Set 

The set T$ Y of jointly typical sequences enjoys three properties similar to what we have 



seen in Section 13.2 and the proofs of these properties are identical to those in Section 13.2 



Property 13.5.1 (Unit Probability) The jointly typical set asymptotically has proba- 
bility one. So as n becomes large, it is highly likely that a source emits a jointly typical 
sequence. We formally state this property as follows: 



Ve > Px{X n Y n e Tf nyn } > 1 - e for sufficiently large n. 



(13.34) 
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Property 13.5.2 (Exponentially Small Cardinality) The number |7^ nyn | of 5- jointly 
typical sequences is exponentially smaller than the total number (|^||3^|) n of sequences for 
any joint random variable (X, Y) that is not uniform. We formally state this property as 
follows: 

| I oc»y«|< 2 «(H(x I y) + *)_ (1335) 

We can also lower bound the size of the ^-jointly typical set when n is sufficiently large: 

Ve > \T s xnyn | > (1 - e )2 n(H{x ' Y) - 5) for sufficiently large n. (13.36) 

Property 13.5.3 (Equipartition) The probability of a particular (^-jointly typical sequence 
x n y n is approximately uniform: 

2 -n(H(X,Y)+S) < pxniYn ( x » iy ») < 2-< H ( X >r)S). (13.37) 

Exercise 13.5.1 Prove the above three properties of the jointly typical set. 

The above three properties may be similar to what we have seen before, but there is 
another interesting property of jointly typical sequences that we give below. It states that 
two sequences drawn independently according to the marginal distributions px(x) and Py{u) 
are jointly typical according to the joint distribution px,y( x i v) with probability ~ 2~ nI ^ X]Y \ 
This property gives a simple interpretation of the mutual information that is related to its 
most important operational interpretation as the classical channel capacity discussed briefly 
in Section [221 

Property 13.5.4 (Probability of Joint Typicality) Consider two independent random 
variables X n and Y n whose respective probability density functions p x „(x n ) and p Y n{y n ) are 
equal to the marginal densities of the joint density Px n y n {.% n , y n )'- 

(x n ,Y n ) ^p xn (x n )p Yn (y n ). (13.38) 

Then we can bound the probability that two random sequences X n and Y n are in the jointly 
typical set T xnyn : 

Pr{(x",F") G T s xnYn } < 2- n ^ x ^- 3S \ (13.39) 



Exercise 13.5.2 Prove Property 13.5.4 (Hint: Consider that 



Pr{(*»,y») E T xnYn ) = Y. Px»{x n )p Yn {y n ), (13.40) 

and use the properties of typical and jointly typical sets to bound this probability.) 
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Figure 13.5: The notion of the conditionally typical set. A typical sequence x n in Tg maps stochastically 
through many instantiations of a conditional distribution Py\x(v\x) to some sequence y n . It is overwhelm- 
ingly likely that y n is in a conditionally typical set T g when n becomes large. This conditionally typical 

set has size around 2 nH ( Y ' x '. It contains nearly all of the probability but is exponentially smaller than the 
set of all sequences y n . 



13.6 Weak Conditional Typicality 

Conditional typicality is a property that we expect to hold for any two random sequences — it 
is also a useful tool in the proofs of coding theorems. Suppose two random variables X and 
Y have respective alphabets X and y and a joint distribution Px,y{x, y)- We can factor the 
joint distribution Px,y(x, v) as the product of a marginal distribution px(x) and a conditional 
distribution py\x(y\x), and this factoring leads to a particular way that we can think about 
generating realizations of the joint random variable. We can consider random variable Y to 
be a noisy version of X, where we first generate a realization x of the random variable X 
according to the distribution px(%) and follow by generating a realization y of the random 
variable Y according to the conditional distribution p Y \x{y\x). 

Suppose that we generate n independent realizations of random variable X to obtain the 
sequence x n = x\ ■ ■ ■ x n . We then record these values and use the conditional distribution 
PY\x(y\x) n times to generate n independent realizations of random variable Y. Let y n = 
j/i • • • y n denote the resulting sequence. 

Definition 13.6.1 (Conditional Sample Entropy). The conditional sample entropy H(y n \x n ) 
of two sequences x n and y n is 



H{y n \x n ) 



n 



\ogp Y n\x4y n \x n ), 



where 



p Y n\x^(y n \x n ) = p Y \x{y\\xi) ■ ■■pY\x{yn\x n )- 



(13.4i; 



'13.42) 



Definition 13.6.2 (Conditionally Typical Set). Suppose that a sequence x n is a typical 
sequence in Tg and y n is a typical sequence in T s . The o -conditionally typical set T s 
consists of all sequences whose conditional sample entropy is S-close to the true conditional 
entropy: 

TP xn = {y n : \H(y n \x n ) - H(Y\X)\ < 5}. (13.43) 
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A different way to define the conditionally typical set is as follows. 

Remark 13.6.1 The ^-conditionally typical set T s corresponding to the sequence x n 

consists of all sequences y n that are jointly typical with x n : 



Tr lxn = {y n :(x n ,y n )eTp Yn }. (13.44) 



<5 



The appearance of marginal typicality in the above definitions may again seem somewhat 
strange, but they are there because it makes sense that the sequence y n should be typical if 
it is conditionally typical (we thus impose this constraint in the definition). This property 
does not necessarily follow from the weak notion of conditional typicality, but it does follow 
without any imposed constraints from a stronger notion of conditional typicality that we 



give in Section 13.9 



13.6.1 Properties of the Conditionally Typical Set 

Y n \x n 

The set T s of conditionally typical sequences enjoys properties similar to what we have 

seen before, and we list them for completeness. 

Property 13.6.1 (Unit Probability) The set T g asymptotically has probability one 

when the sequence x n is random. So as n becomes large, it is highly likely that random 
sequences Y n and X n are such that Y n is a conditionally typical sequence. We formally 
state this property as follows: 

Ve > E X n | Pr J.Y n G rJ" |X " } 1 > 1 - e for sufficiently large n. (13.45) 



rpY n \x n 

1 6 



of 5- conditionally 



Property 13.6.2 (Exponentially Small Cardinality) The number 

typical sequences is exponentially smaller than the total number |3^| n of sequences for any 
conditional random variable Y\X that is not uniform. We formally state this property as 
follows: 

< 2 "W y l*)+ 5 ). (13.46) 



rpY n \x r ' 



We can also lower bound the expected size of the 5- conditionally typical set when n is 
sufficiently large and x n is a random sequence: 

Ve > E x J Tf lXn } > (1 - e )2 n{H{Ylx) - s) for sufficiently large n. (13.47) 

Property 13.6.3 (Equipartition) The probability of a given ^-conditionally typical se- 
quence y n (corresponding to the sequence x n ) is approximately uniform: 

2-<H{Y\X) + 6) < py^y*^) < 2 -n{H { Y\X)-5)_ (13 _ 48) 
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In summary, averaged over realizations of the random variable X n , the conditionally 
typical set T s has almost all the probability, and its size is exponentially smaller than 

the size of the set of all sequences. For each realization of X n , each ^-conditionally typical 
sequence has an approximate uniform probability of occurring. 

Our last note on the weak conditionally typical set is that there is a subtlety in the 



statement of Property 13.6.1 that allows for a relatively straightforward proof. This subtlety 



is that we average over the sequence X n well, and this allows one to exploit the extra 



randomness to simplify the proof. We do not impose such a constraint later on in Section [13^9 
where we introduce the notion of a strong conditionally typical sequence. We instead impose 
the constraint that the sequence x n is a strongly typical sequence, and this property is 
sufficient to prove that similar properties hold for a strong conditionally typical set. 
We now prove the first property. Consider that 

e x J Pr (y n erY nlx 

[Y n \X n I 

= E p**(* b ) E pr»\x»(v n \* n ) ( 13 - 49 ) 

Triczyri -y n \T n 

X &X yneT Y |* 

> E ?*"(*") E PY»\Mrr\x n ) (13.50) 

t » p tX" n w T ,i" 1 |a:' 1 

x ki 5 y n £T s 

= E E p*v(* n >y B ) ( 13 - 51 ) 

>l-e. (13.52) 

The first equality follows by definition. The first inequality follows because the proba- 
bility mass of the set X n can only be larger than the probability mass in the typical 
set T^ n . The last inequality follows because the conditions \H(x n ) — H(X)\ < 5 and 
\H(y n \x n ) - H(Y\X)\ < S imply 

\H(x n , y n ) - H(X, Y)\<5' (13.53) 

for some 5', for which we then have the law of large numbers to obtain this final bound. 

Exercise 13.6.1 Prove that the last two properties hold for the weak conditionally typical 
set. 

13.7 Strong Typicality 

In the development in the previous sections, we showed how the law of large numbers is the 
underpinning method to prove many of the interesting results regarding typical sequences. 
These results are satisfactory and provide an intuitive notion of typicality through the idea 
of the sample entropy approaching the true entropy for sufficiently long sequences. 
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It is possible to develop a stronger notion of typicality with a different definition. Instead 
of requiring that the sample entropy of a random sequence is close to the true entropy of 
a distribution for sufficiently long sequences, strong typicality requires that the empirical 
distribution or relative frequency of symbols of a random sequence has a small deviation 
from the true probability distribution for sufficiently large sequence length. 

We begin with a simple example to help illustrate this stronger notion of typicality. 
Suppose that we generate a binary sequence IID according to the distribution p(0) = 1/4 
and p(l) = 3/4. Such a random generation process could lead to the following sequence: 

0110111010. (13.54) 

Rather than computing the sample entropy of this sequence and comparing it with the true 
entropy, we can count the number of zeros or ones that appear in the sequence and compare 
their normalizations with the true distribution of the information source. For the above 
example, the number of zeros is equal to 4, and the number of ones (the Hamming weight 
of the sequence) is equal to 6: 

N(0 | 0110111010) = 4, N(l | 0110111010) = 6. (13.55) 

We can compute the empirical distribution of this sequence by normalizing the above num- 
bers by the length of the sequence: 

1 2 1 3 

— N(0 | 0110111010) = -, — N(l | 0110111010) = -. (13.56) 

This empirical distribution deviates from the true distribution by the following amount 



max 



1 2 

4 ~ 5 



3 3 

4 ~ 5 



— , (13.57) 

20' K ' 



which is a fairly significant deviation. Though, suppose that the length of the sequence 
grows large enough that the law of large numbers comes into play. We would then expect 
it to be highly likely that the empirical distribution of a random sequence does not deviate 
much from the true distribution, and the law of large numbers again gives a theoretical 
underpinning for this intuition. This example gives the essence of strong typicality. 

We wish to highlight another important aspect of the above example. The particular 



sequence in (13.54) has a Hamming weight of six, but this sequence is not the only one 
with this Hamming weight. By a simple counting argument, there are („) — 1 = 209 other 
sequences with the same length and Hamming weight. That is, all these other sequences have 
the same empirical distribution and thus have the same deviation from the true distribution 



as the original sequence in (13.54). We say that all these sequences are in the same "type 
class," which simply means that they have the same empirical distribution. The type class 
is thus an equivalence class on sequences where the equivalence relation is the empirical 
distribution of the sequence. 

We mention a few interesting properties of the type class before giving more formal 
definitions. We can partition the set of all possible sequences according to type classes. 
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Consider that the set of all binary sequences of length ten has size 2 10 . There is one sequence 
with all zeros, ( 1 ) sequences with Hamming weight one, ( 2 ) sequences with Hamming weight 
two, etc. The binomial theorem guarantees that the total number of sequences is equal to 
the number of sequences in all of the type classes: 

2'° = E( 1 °). (13.58) 

z=0 ^ ' 

Suppose now that we generate ten IID realizations of the Bernoulli distribution p(0) = 1/4 
and p(V) = 3/4. Without knowing anything else, our best description of the distribution of 
the random sequence is 

p(xi, . . . , xio) = p{xi) ■ ■ ■ p(x w ), (13.59) 

where Xi, . . . , Xw are different realizations of the binary random variable. But suppose that 
a third party tells us the Hamming weight wq of the generated sequence. This information 
allows us to update our knowledge of the distribution of the sequence, and we can say that any 
sequence with Hamming weight not equal to wq has zero probability. All the sequences with 
the same Hamming weight have the same distribution because we generated the sequence 
in an IID way, and each sequence with Hamming weight Wq has a uniform distribution after 
renormalizing. Thus, conditioned on the Hamming weight Wq, our best description of the 
distribution of the random sequence is 

i \ \ I ° : w(x 1 ,...,x 1Q )^w 

p(x 1 ,...,x 1 o\w ) = < /io\-i / n , (13.60) 

[ U) : W(X 1 ,...,X W ) =W 

where w is a function that gives the Hamming weight of a binary sequence. This property 
has important consequences for asymptotic information processing because it gives us a 
way to extract uniform randomness from an IID distribution, and we later see that it has 
applications in several quantum information processing protocols as well. 

13.7.1 Types and Strong Typicality 

We now formally develop the notion of a type and strong typicality. Let x n denote a sequence 
X\X2 • • • x n , where each Xi belongs to the alphabet X . Let \X\ be the cardinality of X . Let 
N(x\x n ) be the number of occurrences of the symbol x G X in the sequence x n . 

Definition 13.7.1 (Type). The type or empirical distribution t x n of a sequence x n is a 
probability mass function whose elements are t x n(x) where 

t x n(x) = -N(x\x n ). (13.61) 

n 

Definition 13.7.2 (Strongly Typical Set). The S -strongly typical set T 5 X ™ is the set of all 
sequences with an empirical distribution -N(x\x n ) that has maximum deviation S from the 
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true distribution px(x) ■ Furthermore, the empirical distribution -N(x\x n ) of any sequence 
in Tf n vanishes for any letter x for which px{x) = 0: 



-N(x\x n )-p x (x) 
n 



< 5 ifpx(x) > 0, else -N(x\x n ) = 

n 



rf n = {x n -.^xex, 

n n 

(13.62) 



The extra condition where -N(x\x n ) = when px(%) = is a somewhat technical 
condition, nevertheless intuitive, that is necessary to prove the three desired properties for 
the strongly typical set. Also, we are using the same notation T* n to indicate both the 
weakly and strongly typical set, but which one is appropriate should be clear from context, 
or we will explicitly indicate which one we are using. 

The notion of type class becomes useful for us in our later developments — it is simply 
a way for grouping together all the sequences with the same empirical distribution. Its 
most important use is as a way for obtaining a uniform distribution from an arbitrary IID 
distribution (recall that we can do this by conditioning on a particular type). 

Definition 13.7.3 (Type Class). Let T^ n denote the type class of a particular type t. The 
type class T t is the set of all sequences with length n and type t: 

T t xn = {x n e X n : t xn = t}. (13.63) 

Property 13.7.1 (Bound on the Number of Types) The number of types for a given 
sequence length n containing symbols from an alphabet X is exactly equal to 

A good upper bound on the number of types is 

(n + l) m . (13.65) 

Proof. The number of types is equivalent to the number of ways that the symbols in a 
sequence of length n can form \X\ distinct groups. Consider the following visual aid: 



(13.66) 



We can think of the number of types as the number of different ways of arranging \X\ — 1 
vertical bars to group the n dots into \X\ distinct groups. The upper bound follows from 
a simple argument. The number of types is the number of different ways that \X\ positive 
numbers can sum to n. Overestimating the count, we can choose the first number in n + 1 
different ways (it can be any number from to n), and we can choose the \X\ — 1 other 
numbers in n + 1 different ways. Multiplying all of these possibilities together gives an upper 
bound (n + 1) 1 'on the number of types. This bound illustrates that the number of types 
is only polynomial in the length n of the sequence (compare with the total number \X\ n of 
sequences of length n being exponential in the length of the sequence). □ 
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Definition 13.7.4 (Typical Type). Let px(x) denote the true probability distribution of 
symbols x in the alphabet X . For 5 > 0, let t$ denote the set of all typical types that have 
maximum deviation 5 from the true distribution px{x): 

T 5 = {t-.yxe X, \t(x) -px(x)\ < 5 tfpx(x) > else t(x) = 0}. (13.67) 

We can then equivalently define the set of strongly ^-typical sequences of length n as a 
union over all the type classes of the typical types in r^: 



Tf = {JT t x \ (13.68) 



t£r s 

13.7.2 Properties of the Strongly Typical Set 

The strongly typical set enjoys many useful properties (similar to the weakly typical set). 

Property 13.7.2 (Unit Probability) The strongly typical set asymptotically has proba- 
bility one. So as n becomes large, it is highly likely that a source emits a strongly typical 
sequence. We formally state this property as follows: 

Ve > Pr{X n G Tf 1 } > 1 - e for sufficiently large n. (13.69) 

Property 13.7.3 (Exponentially Small Cardinality) The number \T* n \ of 5-typical 
sequences is exponentially smaller than the total number \X\ n of sequences for most random 
variables X. We formally state this property as follows: 

| T X«| < 2 n(H(X)+c5)^ (13 _ 70) 

where c is some positive constant. We can also lower bound the size of the 5-typical set when 
n is sufficiently large: 

Ve > \Tf n | > (1 - e )2 n( - H W- cd ^ for sufficiently large n and some constant c. (13.71) 

Property 13.7.4 (Equipartition) The probability of a given 5-typical sequence x n occur- 
ring is approximately uniform: 

2 -n(H(X)+c5) < pxn ( x ») < 2 ~ n ^ X )- c& ) . (13.72) 

This last property of strong typicality demonstrates that it implies weak typicality up to 
an irrelevant constant c. 

13.7.3 Proofs of the Properties of the Strongly Typical Set 



Proof of the Unit Probability Property (Property 13.7.2). The proof proceeds similarly to the 
proof of the unit probability property for the weakly typical set. The law of large numbers 
states that the sample mean of a random sequence converges in probability to the expectation 
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of the random variable from which we generate the sequence. So consider a sequence of IID 
random variables Xi, . . . , X n where each random variable in the sequence has expectation 
\x. The sample average of this sequence is as follows: 



X 



1 n 
n *-^ 

i=l 



The precise statement of the weak law of large numbers is that 

\/e,5>0 3n :\/n>n Pr{|X - ji\ > 8} < e. 



(13.73) 



(13.74) 



We can now consider the indicator random variables I(Xi = a), . . . , I(X n = a). The sample 

mean of a random sequence of indicator variables is equal to the empirical distribution 

N(a\X n )/n: 

1 n 1 

- y"l(Xi = a) = -N(a\X n ), (13.75) 

n *-^ 



i=\ 



I! 



and the expectation of the indicator random variable I(X = a) is equal to the probability of 
the symbol a: 

E x {I(X = a)}=p x (a). (13.76) 

Also, any random sequence X n has probability zero if one of its symbols X{ is such that 
Px{xi) = 0. Thus, the probability that -N(a\X n ) = is equal to one whenever px{o) = 0: 



(13.77) 



Pr<^ -N{a\X n ) = : p x (a) =0^ = 1, 



I! 



and we can consider the cases when px( a ) > 0. We apply the law of large numbers to find 
that 



Ve, 5 > 3n a : Vn > n a Pr 



n 



-N{a\X n )- Px {a) 



>5\ < 



\X\ 



(13.78 



Choosing n = max aeA -{no ia } 5 the following condition holds by the union bound of probability 
theory: 



Ve, S > 3n :\/n>rio 



Pr fu 

Ka&X 



I! 



-N(a\X n ) -p x {a) 



>5 



<E Pr 

a&X 



I! 



-N{a\X n )- Px {a) 



> 5 } < e. (13.79) 



Thus it holds that the complement of the above event on the left holds with high probability: 

1 



Ve, S > 3n : Vn > n Pt< Va G X, 



I! 



-N(a\X n )-p x (a) 



<5\ > 1 



e. 



(13.80) 
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The event {Vo G X, \^N(a\X n ) — px(a)\ < 5} is the condition for a random sequence X n 
to be in the strongly typical set T*", and the probability of this event goes to one as n 
becomes sufficiently large. □ 



Proof of the Exponentially Small Cardinality Property (Property 13.7.3). By the proof of Prop- 



erty 13.7.4 (proved below), we know that the following relation holds for any sequence x n in 



the strongly typical set: 

2 -n(H(X)+c&) < pxn ( x n} < 2 -n(H(X)-cS) ^ ^3^) 



where c is some constant that we define when we prove Property 13.7.4. Summing over all 
sequences in the typical set, we get the following inequalities: 

J2 2 -»W*>+°*) < Pr{;r G 7^" }< J2 2~< H ^- c5 \ (13.82) 

=> 2-»(*W+ rf >|T/ fft | < p r {x" G if"} < 2 -^W-^)|rf n |. (13.83) 

By the unit probability property of the strongly typical set, we know that the following 
relation holds for sufficiently large n: 

1 > Px{X n G Tf 1 } > 1 - e. (13.84) 

Then the following inequalities result by combining the above inequalities: 

2 n(H(X)-cS)(l _ e ) < | T *"| < 2 n(H(X)+c8)_ (13.85) 

D 



Proof of the Equipartition Property (Property 13.7.4). The following relation holds from the 



IID property of the distribution px* {x n ) and because the sequence x n is strongly typical: 

PXn(x n )= Y[PX{X)W\ (13 _ 86) 

where X + denotes all the letters x in X with px(x) > 0. (The fact that the sequence x n is 



strongly typical according to Definition 13.7.2 allows us to employ this modified alphabet) 



Take the logarithm of the above expression: 

log(p X n(x n )) = J2 N(x\x n )log( Px (x)), (13.87) 

x&X+ 

Multiply both sides by — -: 

- -log(p X n(x n )) = - Y] -N(x\x n )log( Px (x)), (13.88) 

n * — ' n 

x&X+ 
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The following relation holds because the sequence x n is strongly typical: 

1 



Vie x- 



n 



-N(x\x n )-p x {x) 



<5, 



and it implies that 



Vx ex 



+ . 



-5 + p x (x) < -N(x\x n ) < 5 + p x (x). 



ii 



;i3.89) 



(13.90) 



Now multiply (13.90) by — \og(px(x)) > 0, sum over all letters in the alphabet X + , and 



apply the substitution in (13.88). This procedure gives the following set of inequalities: 






-S + px(x))log(p x {x)) < log{p x ^(x n )) 

n 



<- ^2(8+Px(x))log(p x (x)), (13.91) 

xeX+ 



where 



-cS + H(X) < — log(p X n{x n )) <cS + H(X), 



n 



2-^W+rf) < p X n(x n ) < 2~< H W- 
c=~Yl Mpx(x)) > 0. 



cS) 



x€X+ 



(13.92) 
(13.93) 

(13.94) 



It now becomes apparent why we require the technical condition in the definition of strong 



typicality (Definition 13.7.2). Were it not there, then the constant c would not be finite, and 
we would not be able to obtain a reasonable bound on the probability of a strongly typical 
sequence. □ 



13.7.4 Cardinality of a Typical Type Class 

Recall that a typical type class is the set of all sequences with the same empirical distribu- 
tion, and the empirical distribution happens to have maximum deviation 5 from the true 
distribution. It might seem that the size \T^- n | of a typical type class T^ n should be smaller 
than the size of the strongly typical set. But the following property overrides this intuition 
and shows that a given typical type class T 4 X " has almost as many sequences in it as the 
strongly typical set Tf n for sufficiently large n. 

Property 13.7.5 (Minimal Cardinality of a Typical Type Class) For t G t$ and for 

sufficiently large n, the size \Tj X "\ of the typical type class T* n is lower bounded as follows: 

1 



I „, vn I 

\T t I > 



_ 2 n[H(X)-r,(\X\5)] = 2 n[H(X)-r,{\X\6)-\X\±log(n+l)}^ 



_ - _ . (13.95) 

(n + l) 1 ' 

where rj(5) is some function such that r/(5) — > as 5 — > 0. Thus, a typical type class is of 
size roughly 2 nH ^ when n — > oo and 6 — ► (it is about as large as the typical set when n 
becomes large). 
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Proof. We first show that if X\, . . . , X n are random variables drawn IID from a distribution 
q(x), then the probability q n (x n ) of a particular sequence x n depends only on its type: 

q n ( x n ) = 2~ n ( H (^ n ) +D ( <:cn ll 9 )) (13.96) 

where D(t x n\\q) is the relative entropy between t x n and q. Consider the following chain of 
equalities: 



q n (xn = n «(*«) = n ^f {xixn) = n i^r tMx) ^^ 

— TT 2 n *^"( :E ) lo g9(^) _ 2 n ^x€x t ^ n ( x ) 1 °s<i( x ) (13.98) 

x&X 

OnY2x€X tx n ( x ) log <l(x)—t x n (x) logtj-n (x)+t x n(x) logt x n (x) I -| o QQ\ 

_ 2-™(- D (**»ll?)+- ff (**")) (13.100) 

It then follows that the probability of the sequence x n is 2~ nH ^ txn " > if the distribution q{x) = 
t x n(x). Now consider that each type class T* n has size 

, s n i \), (13.101) 

nt x n(xi),...,nta.»(a:|A;|J/ 

where the distribution £ = (ir»(xi), . . . ,t x n(x\x\)) and the letters of X are Xi, . . . ,x\x\. 
This result follows because the size of a type class is just the number of ways of arranging 
nt x n(xi), . . . ,nt x n(x\x\) in a sequence of length n. We now prove that the type class T^ n 
has the highest probability among all type classes when the probability distribution is t: 

t n (T t xn ) > t n (T t r) for all t' <= V n , (13.102) 

where t n is the IID distribution induced by the type t and V n is the set of all types. Consider 
the following equalities: 



yrd x n(x) 

t n (T t n 



\rr\ n **»(*) 

x&X 



t n {T t D \T t r\ n t x n{ x ) nt * n{x) 

x&X 
\nt x n(xi),...,nt x n(x lxl )) H tx n \ 



nt x n (x) 

; x n ydj J 

x&X 



nt' r n (x) 
x&X 



(13.103) 



(13.104) 



\nt'„(xi),...,nt' n (x, xl )) 11 tx n {X) 
x ny i„ x ny \x\) xeX 

n ^44!^(^) n(ta;nW "^ w) - (i3.io5) 

- LJ - nt x n[x)\ 



xdX 
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Now apply the bound ^f > n m n (that holds for any positive integers m and n) to get 



t n {T{, 



- / \ TT r j- / \-\n(t' n (x)-t x n(x)), / \n(t x n(x)-t' „(x)) 

x~ > [[[nt x n(x)\ ^" v ; v 'n x n(x) \' v ' * nK >> 



x&X 

TT n n ( t in( a; )-^ n ( a:: )) 

x&X 

n n J2 x€ X t' xn {x)-t x -n{x) 

n n(l-l) 
1. 



(13.106) 
(13.107) 

(13.108) 
(13.109) 
(13.110) 



Thus, it holds that t n (T xn ^j > t n (T x "^ for all t' . Now we are close to obtaining the desired 



bound in Property |13.7.5[ Consider the following chain of inequalities: 

i = E nni < E niaxr(rn = E in (?T) 



t'SPn 



t'eP„ 



t'SPn 



<(n + l) l * l t B (lf")=(n + l) 1 * 1 E *' 



nf x n ) 



x^eT* 



x«£T t xn 

(n+lfh- nH U\T t xn \ 



(13.111) 
(13.112) 

(13.113) 
(13.114) 



Recall that t is a typical type, implying that \t(x) — p(x)\ < 5 for all x. This then implies 
that the variational distance between the distributions is small: 



E>(*)-pO»OI<I*I<s- 



(13.115) 



We can apply Fannes' inequality for continuity of entropy (Theorem 11.9.5) to get a bound 
on the difference of entropies: 



\H(t) - H(X)\ < 2\X\5\og\X\ + 2H 2 (\X\8). 
The desired bound then follows with r)(\X\5) = 2\X\S\og\X\ + 2H 2 (\X\S). 



(13.116) 

D 



Exercise 13.7.1 Prove that 2 nlil ^ is an upper bound on the number of sequences x 11 of 
type t: 



\T xn \ < 2 nH ^ 



(13.117) 



Use this bound and (|13.100|) to prove the following upper bound on the probability of a type 

(13.118) 



class where each sequence is generated IID according to a distribution q(x): 

Pr{T t x "} < 2~ nD{t I' q) . 
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13.8 Strong Joint Typicality 



It is possible to extend the above notions of strong typicality to jointly typical sequences. In 
a marked difference with the weakly typical case, we can show that strong joint typicality 
implies marginal typicality. Thus there is no need to impose this constraint in the definition. 
Let N(x,y\x n ,y n ) be the number of occurrences of the symbol x G X,y G y in the 
respective sequences x n and y n . The type or empirical distribution t x n y n of sequences x n and 
y n is a probability mass function whose elements are t x n y n(x,y) where 

t x n yn (x,y) = ^N(x,y\x n ,y n ). (13.119) 

Definition 13.8.1 (Strong Jointly Typical Sequence). Two sequences x n ,y n are 8-strongly- 
jointly-typical if their empirical distribution has maximum deviation 8 from the true distri- 
bution and vanishes for any two symbols x and y for which Px,y(x,v) = 0. 

Definition 13.8.2 (Strong Jointly Typical Set). The 8 -jointly typical set Tf- nyn is the set 
of all 8 -jointly typical sequences: 



rpX n Y v 

J 5 



x n ,y n :V(x,y) G X x y 



±N(x,y\x n ,y n ) -p x ,y(x,y)\ < 8 ifpx,v(x,y) > 

^N(x,y\x n ,y n ) = otherwise 

(13.120) 



It follows from the above definitions that strong joint typicality implies marginal typicality 
for both sequences x n and y n . We leave the proof as the following exercise. 

Exercise 13.8.1 Prove that strong joint typicality implies marginal typicality for either the 
sequence x n or y n . 

13.8.1 Properties of the Strong Jointly Typical Set 

The set T 5 x " y " of strong jointly typical sequences enjoys properties similar to what we have 
seen before. 

Property 13.8.1 (Unit Probability) The strong jointly typical set Tf nYn asymptotically 
has probability one. So as n becomes large, it is highly likely that a source emits a strong 
jointly typical sequence. We formally state this property as follows: 

Ve > Pr{X n Y n G T 5 x " y "} > 1 - e for sufficiently large n. (13.121) 

Property 13.8.2 (Exponentially Small Cardinality) The number IT^" 1 "*! of ^-jointly 
typical sequences is exponentially smaller than the total number (|<-f ||3^|) n of sequences for 
any joint random variable (X, Y) that is not uniform. We formally state this property as 
follows: 

|7f" yn | <2 n ^ x ' Y ^ cS \ (13.122) 
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where c is a constant. We can also lower bound the size of the ^-jointly typical set when n 
is sufficiently large: 

Ve > |Tf" yn | > (1 - e ) 2 "W x > y )- c<5 ) for sufficiently large n. (13.123) 

Property 13.8.3 (Equipartition) The probability of a given 5-jointly typical sequence 
x n y n occurring is approximately uniform: 

2 . n{m x,Y)+cS) < pxntYn ( x »,y") < 2 -n{H { X,Y)-c5)_ (13.124) 

Property 13.8.4 (Probability of Strong Joint Typicality) Consider two independent 
random variables X n and Y n whose respective probability density functions p x „(x n ) and 
p Y n(y n ) are equal to the marginal densities of the joint density Px n ,Y n (x n ,y n )'- 



(X n X n ) ~p x <x n )p Yn (y n ). (13.125) 

Then we can bound the probability that two random sequences X n and Y n are in the jointly 

pX n Y n . 

Pr{ (jt n , Y n ) e Tf yn } < 2-"( / ( X;y )- 3c ^. (13.126) 



typical set T^ n) " ■ 



The proofs of the first three properties are the same as in the previous section, and the 
proof of the last property is the same as that for the weakly typical case. 

13.9 Strong Conditional Typicality 

Strong conditional typicality bears some similarities to strong typicality, but it is sufficiently 
different for us to provide a discussion of it. We first introduce it with a simple example. 
Suppose that we draw a sequence from an alphabet {0, 1, 2} according to the distribution: 

Px(0) = \, Px(l) = \, Px(2)= 1 -. (13.127) 

A particular realization sequence could be as follows: 

2010201020120212122220202222. (13.128) 

We count up the occurrences of each symbol and find them to be 

7V(0 | 2010201020120212122220202222) = 8, (13.129) 

iV(l | 2010201020120212122220202222) = 5, (13.130) 

iV(2 | 2010201020120212122220202222) = 15. (13.131) 

The maximum deviation of the sequence's empirical distribution from the true distribution 
of the source is as follows: 

f 1 8 1 5 1 15 ] f 1 2 1 1 1 

max< , , > = max< — , — , — > = — . (13.132) 

\ 4 28 4 28 2 28 J \ 28 28 28 J 14 v ; 
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We now consider generating a different sequence from an alphabet {a,b,c}. Though, we 
generate it according to the following conditional probability distribution: 



PY\x{a\0) 
PY\x(b\0) 

Py\x(c\0) 



PY\x(a\l) 
PY\x(b\l) 
Py\x(c\1) 



PY\x(a\2) 
PY\x(b\2) 
Py\x(c\2) 



'13.133) 



The second generated sequence should thus have correlations with the original sequence. A 
possible realization of the second sequence could be as follows: 



abbcbccabcabcabcabcbcbabacba. 



(13.134) 



We would now like to analyze how close the empirical conditional distribution is to the true 
conditional distribution for all input and output sequences. A useful conceptual first step is 
to apply a permutation to the first sequence so that all of its symbols appear in lexicographic 
order, and we then apply the same permutation to the second sequence: 

2010201020120212122220202222 
abbcbccabcabcabcabcbcbabacba 

permute 

> 

0000000011111222222222222222 
bccaccbbbcabaabbbacbcbcaacba' 

(13.135) 

This rearrangement makes it easy to count up the empirical conditional distribution of the 
second sequence. We first place the joint occurrences of the symbols into the following 

~ N(0,a) = l N(l,a) = 2 N(2,a) = 5~ 
JV(0,6) = 3 iV(l, 6) = 2 N(2,b)=6 , (13.136) 

iV(0,c)=4 iV(l,c) = l 7V(2,c)=4_ 

and we obtain the empirical conditional distribution matrix by dividing these entries by the 
marginal distribution of the first sequence: 



N(0,a) 
JV(0) 

AW) 
N(0) 

N(0,c) 
N(0) 



N{l,a) 
Ml) 

N(l,b) 
Ml) 

N(l,c) 
N(l) 



N(2,a) 
N(2) 

N(2,b) 
AT(2) 

N(2,c) 
iV(2) 



15 
_6_ 
15 
J_ 

15 



(13.137) 



We then compare the maximal deviation of the elements in this matrix with the elements in 



the stochastic matrix in 


(13.133) 




max< 


1 1 
5 8 


J 


2 3 
5 8 


5 


2 4 
5 8 


J 


1 2 
6 5 


3 


3 2 
6 5 


5 


2 1 
6 5 


5 


2 5 
4 15 


J 


1 6 
4 15 


J 


1 4 
4 15 



= max 



1 



40 40 10 30 10 15 6 20 60 



30' 



(13.138) 
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The above analysis applies to a finite realization to illustrate the notion of conditional typ- 
icality, and there is a large deviation from the true distribution in this case. We would 
again expect this deviation to vanish for a random sequence in the limit as the length of the 
sequence becomes asymptotically large. 

13.9.1 Definition of Strong Conditional Typicality 

We now give a formal definition of strong conditional typicality. 

Definition 13.9.1 (Conditional Empirical Distribution). The conditional empirical distri- 
bution t y n\ x n{y\x) is as follows: 



ty n \x n {y\X) = 7 7~T 
Z x n[X) 



w(y|s)= V Y\ ■ ( 13 - 139 ) 



Definition 13.9.2 (Strong Conditional Typicality). Suppose that a sequence x n is a strongly 
typical sequence in Tf . Then the o-strong conditionally typical set T s corresponding to 

the sequence x n consists of all sequences whose joint empirical distribution -N(x,y\x n ,y n ) 
is S -close to the product of the true conditional distribution Py\x{v\x) with the marginal 
empirical distribution -N(x\x n ): 

Y n \x n __ 
2 6 — 

„ n -\K>r *,\ a x v -v \N(x,y\x n ,y n ) - p(y\x)N(x\x n )\ < n5 if p{y\x) > 1 

y . V{x,y) e-txj/ N(x,y\x n ,y n ) = otherwise J ' ^ 6AW ) 

where we abbreviate Py\x(u\x) as p(y\x) 

The above definition of strong conditional typicality implies that the conditional empirical 
distribution is close to the true conditional distribution, in the sense that 



-PY\x(y\x) 



t x n(X) t x n.[Xj 

Of course, such a relation only makes sense if the marginal empirical distribution t x n(x) is 
non-zero. 



The extra technical condition (N(x, y\x n , y n ) = if Py\x(u\x) = 0) in Definition 13.9.2 



is present again for a reason that we found in the proof of the Equipartition Property for 



Strong Typicality (Property 13.7.4). 



13.9.2 Properties of the Strong Conditionally Typical Set 

Y n \ x ri 

The set T 8 of conditionally typical sequences enjoys a few useful properties that are 

similar to what we have for the weak conditionally typical set, but the initial sequence x n 
can be deterministic. Though, we do impose the constraint that it has to be strongly typical 
so that we can prove useful properties for the corresponding strong conditionally typical set. 
So first suppose that a given sequence x n G Tgf n . 
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Property 13.9.1 (Unit Probability) The set T s asymptotically has probability one. 

So as n becomes large, it is highly likely that a random sequence Y n corresponding to a given 
typical sequence x n is a conditionally typical sequence. We formally state this property as 
follows: 

Ve > Pr[y n G Tp*" } > 1 - e for sufficiently large n. (13.142) 

Property 13.9.2 (Exponentially Small Cardinality) The number T s of o- conditionally 

typical sequences is exponentially smaller than the total number \y\ n of sequences for any 
conditional random variable Y that is not uniform. We formally state this property as 
follows: 

< 2 n(H(Y\X)+c(8+8')) _ (13.143) 



rpY n \x n 

2 8 



We can also lower bound the size of the (^-conditionally typical set when n is sufficiently 
large: 



Ve>0 



rpY n \x r - 



> (1 - e )2 n{H{Y \ x) - c{s+s,)) for sufficiently large n. (13.144) 



Property 13.9.3 (Equipartition) The probability of a particular ^-conditionally typical 
sequence y n is approximately uniform: 

2 -n{H ( Y\X )+ c { 5 + 8>)) < pyn|xn ( y -| x n) < 2 -n(H(Y\X)-c(8 + 6>)) (13.145) 

In summary, given a realization x n of the random variable X n , the conditionally typical 
set T g has almost all the probability, its size is exponentially smaller than the size of the 
set of all sequences, and each (^-conditionally typical sequence has an approximate uniform 
probability of occurring. 

13.9.3 Proofs of the Properties of the Strong Conditionally Typ- 
ical Set 



Proof of the Unit Probability Property (Property 13.9.1 The proof of this property is some- 
what more complicated for strong conditional typicality. Since we are dealing with an IID 
distribution, we can assume that the sequence x n is lexicographically ordered with an order 
on the alphabet X. We write the elements of X as Xi, . . . , x\x\- Then the lexicographic 
ordering means that we can write the sequence x n as follows: 

x n = X\ • • • X\ X2 • • • X2 • • • x\x\ ■ ■ ■ X\x\ ■ (13.146) 

It follows that N(x\x n ) > n(px(x) — 5') from the typicality of x n , and the law of large 
numbers comes into play for each block x» • • • Xj with length N(xi\x n ) when this length is large 
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enough. Let Py\x=x(v) be the distribution for the conditional random variable y|(X = x). 
Then the following set is a slightly stronger notion of conditional typicality: 



{y n eTf lxn }^ /\{y N ^e Tl 

x&X ^ 



N{x\x») a T (Y\(X=x)) N (^ n ) 



(13.147) 



where the symbol A denotes concatenation (note that the lexicographic ordering of x n applies 

to the ordering of the sequence y n as well). Also, T 5 is the typical set for a 

sequence of conditional random variables ^|(X = x) with length N(x\x n ): 

N{y\y N ^ n )) 



T (Y\ { X=*))»(^) ^ ^ yN{x]xn .^ ye y_ 



N(x\x n ) 



Pv\x=x{y) 



<5\. (13.14? 



We can apply the law of large numbers to each of these typical sets T s 
the length N(x\x n ) becomes large. It then follows that 



(X\(x=x)) It ^' n ) 



where 



p r | y n e T p*" j = "Q pJyN^) e T (r\(X=*)) N ^ n n (1 g U9) 



x&X 

>(1 -e 



,1*1 



> 1 - \X\e. 



(13.150) 
(13.151) 

D 



Proof of the Equipartition Property (Property 13.9.3). The following relation holds from the 



IID property of the conditional distribution PY n \x n (y n \% n ) and because the sequence y n is 
strong conditionally typical according to Definition 13.9.2: 



P^ix«(y n \x n ) = II PY\x(y\xf {x ' ylxn ' yn \ 



(13.152) 



where (X,y) + denotes all the letters x, y in X, y with py\x(y\x) > 0. Take the logarithm 
of the above expression: 



\og(p Y n lX n(y n \x n )) = Yl N(x,y\x n ,y n )log(p Ylx (y\x)), 

x,y£(X,y) + 



(13.153) 



Multiply both sides by — -: 



n 



log{p Y n lXn (y n \x n )) = ~ J2 lN(x,y\x n ,y n )log{ P Y\x(y\x)). (13.154) 



x,y&(X,y)+ 
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The following relations hold because the sequence x n is strongly typical and y n is strong 
conditionally typical: 



Vx e X + : 


-N(x\x n )-p x (x) 
n 


<s f , 




(13.155) 


^Vxe X+ : -S'+pxix) < -N(x\x n ) <5' + p x (x), 

n 


(13.156) 


Vx,ye(X,y) + : 


-N(x,y\. 
n 


v n ,y n )-PY\x(y\x)-N(x\x n ) 
n 


<s 


(13.157) 



Vx,y e (X,y) + : -S + p Ylx (y\x)-N(x\x n ) < -N(x,y\x n ,y n ) 

n n 



<S + p Y \x(y\x)-N(x\x n ) (13.158) 



n 



Now multiply (13.158) by — lo g(pY\x( y\x)) > 0, sum over all letters in the alphabet [X, y) , 
and apply the substitution in (13.154). This procedure gives the following set of inequalities: 



Y (-$ + PY\x(y\x)-N(x\x n )j log(p Ylx (y\x)) 



x,y£(X,y) + 



<— \og(p Y n lxn (y n \x n )) 



II 



<- Y ( S + PY\x(y\x)lN(x\x n yj\og(p Ylx (y\x)), (13.159) 



x,ye(x,y) 



Now apply the inequalities in (13.156) (assuming that p x (x) > 8' for x G X + ) to get that 

(13.160) 



Y (-6 + p Y \x(y\x){-5' + px(x))) log(p Y \x(y\x)) 
x,ye(x,y)+ 

< -Uog(p Yn]X n(y n \x n )) 

n 

^~ Y i s + PY\x(y\x)(S' +p x (x)))log(p Y \ x (y\x)) 
x,ye(x,y)+ 



-c{5 + 5') + H(Y\X) < — log(p Y nix«(y n \x n )) < c{5 + 5') + H(Y\X), 

_^ 2 -n(H(Y\X)+c(S+6')) < p Y n\ X n(y n \x n ) < 2- n ( H ( Y \ X )-< S+S ">\ 



(13.161) 
(13.162) 

(13.163) 
(13.164) 



where 



(13.165) 



c=~ Y l °s{PY\x(y\x)) > 0. 

x, v e(x,y)+ 

It again becomes apparent why we require the technical condition in the definition of strong 



conditional typicality (Definition 13.9.2). Were it not there, then the constant c would not 
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be finite, and we would not be able to obtain a reasonable bound on the probability of a 
strong conditionally typical sequence. □ 

We close this section with a lemma that relates strong conditional, marginal, and joint 
typicality 

Y n \x n 

Lemma 13.9.1. Suppose that y n is a conditionally typical sequence in T s and its con- 

ditioning sequence x n is a typical sequence in T* n . Then x n and y n are jointly typical in the 
set T^ s Y n , and y n is a typical sequence in T?C, s+s ,y 

Proof. It follows from the above that \/x G X , y G y : 



px(x) - 8' < -N(x\x n ) < 8' + px(x), (13.166) 

n 

p Y \x(y\x)-N(x\x n ) - 8 < -N(x,y\x n y n ) < 6 + p Y \ X (y\x)-N(x\x n ). (13.167) 

n n n 

Substituting the upper bound on -N(x\x n ) gives 

-N(x,y\x n y n )<5 + p Y \x(y\x)(5'+p x (x)) (13.168) 

n 

= S + p Y \x(y\x)S' + px{x)p Y \x{y\x) (13.169) 

<S + 5' + p X:Y (x,y). (13.170) 

Similarly, substituting the lower bound on -N(x\x n ) gives 

-N(x, y\x n y n ) > p xx (x, y) - 8 - 8'. (13.171) 

n 

Putting both of the above bounds together, we get the following bound: 

1 



n 



-N(x,y\x n y n )- Px ,Y(x,y) 



<8 + 8'. (13.172) 



-yriY'n 



This then implies that the sequences x n and y n lie in the strong jointly typical set T s+S , 
It follows from the result of Exercise 



13.8.1 



that y» G Tfawy D 



13.10 Application: Shannon's Channel Capacity Theo- 
rem 

We close the technical content of this chapter with a remarkable application of conditional 



typicality: Shannon's channel capacity theorem. As discussed in Section |2.2.3[ this theorem 
is one of the central results of classical information theory, appearing in Shannon's seminal 
paper. The theorem establishes that the highest achievable rate for communication over 
many independent uses of a classical channel is equal to a simple function of the channel. 
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We begin by denning the information processing task and a corresponding (n, R, e) chan- 



nel code. It is helpful to recall Figure 2A_ depicting a general protocol for communication 
over a classical channel TV = py\x{y\x)- Before communication begins, the sender Alice and 
receiver Bob have already established a codebook {x n (m)} meM , where each codeword x n {m) 
corresponds to a message m that Alice might wish to send to Bob. If Alice wishes to send 
message m, she inputs the codeword x n {m) to the IID channel Af n = PY n \x n {y n \x n ). More 
formally, her map is some encoding E n : Ai — > X n . She then exploits n uses of the channel to 
send x n (m). Bob receives some sequence y n from the output of the channel, and he performs 
a decoding D n : y n — > AA in order to recover the message m that Alice transmits. The rate 
R of the code is equal to log 2 |A4|/n, measured in bits per channel use. The probability of 
error p e for an (n, R, e) channel code is bounded from above as 

p e = maxPr {D n (Af n (E n (m))) + m) < e. (13.173) 

rn 

A communication rate R is achievable if there exists an (n,R — 5,e) channel code for all 
e, S > and sufficiently large n. The channel capacity C(Af) of Kf is the supremum of all 
achievable rates. We can now state Shannon's channel capacity theorem: 

Theorem 13.10.1 (Shannon Channel Capacity). The maximum mutual information I(Af) 
is equal to the capacity C{J\f) of a channel J\f = Py\x(v\x): 

C(J\f) = I{Af) = max I(X; Y). (13.174) 

Proof. The proof consists of two parts. The first part, known as the direct coding theorem, 
demonstrates that the RHS < LHS. That is, there is a sequence of channel codes with rate 
I{M) that are achievable. The second part, known as the converse part, demonstrates that 
the LHS < RHS. That is, it demonstrates that the rate on the RHS is optimal, and it is 
impossible to have achievable rates exceeding it. Here, we prove the direct coding theorem 



and hold off on proving the converse part until we reach the HSW theorem in Chapter 19 



because the converse theorem there suffices as the converse part for this classical theorem. 



We have already outlined the proof of the direct coding theorem in Section [2.2.4[ and it might 
be helpful at this point to review this section. In particular, the proof breaks down into three 
parts: random coding to establish the encoding, the decoding algorithm for the receiver, and 
the error analysis. We now give all of the technical details of the proof because this chapter 
has established all the tools that we need. Code Construction. Before communication 
begins, Alice and Bob agree upon a code by the following random selection procedure. For 
every message m G A4, generate a codeword x n {rn) IID according to the product distribution 
Px n (x n ), where Px{x) is the distribution that maximizes I(A/"). Encoding. If Alice wishes 
to send message m, she inputs the codeword x n (m) to the channels. Decoding Algorithm. 
After receiving the sequence y n from the channel outputs, Bob tests whether y n is in the 
typical set Tj n corresponding to the distribution py(y) = J2 x PY\x(y\ x )Px(x)- If it is not, 



then he reports an error. He then tests if there is some message m such that the sequence 



y n is in the conditionally typical set TV . If m is the unique message such that y n E 
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T s , then he declares m to be the transmitted message. If there is no message m such 

that y n E T s or multiple messages m! such that y n E T s , then he reports an 

error. Observe that the decoder is a function of the channel, so that we might say that we 
construct channel codes "from the channel." Error Analysis. As discussed in the above 
decoding algorithm, there are three kinds of errors that can occur in this communication 
scheme when Alice sends the codeword x n (m) over the channels: 

£o(m): The event that the channel output y n is not in the typical set T$ . 

£i(m): The event that the channel output y n is in Tg but not in the conditionally typical 

set T s ' v ' . 

£2(171): The event that the channel output y n is in Tj n but it is in the conditionally typical 
set for some other message: 

{y n E Tf} and {3m' : y n E Tp*"™}. (13.175) 

For each of the above events, we can exploit indicator functions in order to simplify the 
error analysis (we are also doing this to help build a bridge between this classical proof and 
the packing lemma approach for the quantum case in Chapter [15] — projectors in some sense 
replace indicator functions later on). Recall that an indicator function Ijl(x) is equal to one 
if x E A and equal to zero otherwise. So the following three functions being equal to one or 
larger then corresponds to error events £o(m), £i(m), and £2(112), respectively: 

l-I Tr (y n ), (13.176) 

I Tr (y n )(l - J T y»,,» (m) (y»)), (13.177) 

J2 I Tr (y n )I T r^n {m/) ( y n ). (13.178) 



Recall from Section |2.2.4| that it is helpful to analyze the expectation of the average error 
probability, where the expectation is with respect to the random selection of the code and 
the average is with respect to a uniformly random choice of the message m. That is, we 
analyze 

Ex J r^n Yl M£o(m) U £ x (m) U £ 2 (m)} I. (13.179) 

(The notation Ex« implicitly indicates an expectation over all codewords.) Our first "move" 
is to exchange the expectation and the sum: 

-J- Y, E X n (m) {Pr{£ (m) U £ x (m) U £ 2 (m)}}. (13.180) 

' ' 771 
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Since all codewords are selected in the same way (randomly and independently of the message 
m), it suffices to analyze E X n^{Pr{£o(m) U £\{m) U £2(171)}} for just a single message m. 
So we can then apply the union bound: 



E X n {m) {Pr{S (m) U S 1 {m) U £ 2 (m)}} 

< E x „ (m) {Pr{£ (m)}} + E x „ (m) {Pr{^(m)}} + E x „ (m) {Pr{£ 2 (m)}}. (13.181) 



We now analyze each error individually. By exploiting the indicator function from (13.176), 
we have that 



E Xn(m) {Pr{£ (m)}} 



{E Y n lxHm) {l-I Tr (Y n )}} 



X"(mn ^Y n \X n (m) 

E X n^ m ^ Y ^\l T y n (Y n )> 

E Y nil T yn{Y n )\ 



= Pr{F n i T D 



(13.182) 
(13.183) 

(13.184) 

(13.185) 
(13.186) 



where in the last line we have exploited the high probability property of the typical set 
Tj n . In the above, we are also exploiting the fact that Ej/^} = Pr{^4}. By exploiting the 



indicator function from (13.177), we have that 



E xHm) {Px{£ 1 (m)}} 
= E x » (m) {Ey»|*» (m) {/ Tr (y n )(l - I T rn lxn(m) (Y n )) }} 

< Ex"(m)|Eyn| X n( m )|l — I T Y^\x^ (m )(Y n )>> 
= 1 — Ex"(m) |Eyn|x«(m) W T ^ n |x"(m)(F n ) j j 



E XH J Pr Wrf™}) 



<e. 



(13.187) 
(13.188) 
(13.189) 

(13.190) 
(13.191) 



where in the last line we have exploited the high probability property of the condition- 

yn I vtl (rn\ 

ally typical set T s . We finally consider the probability of the last kind of error by 
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exploiting the indicator function in (13.178): 
E XHm) {Pr{£ 2 (m)}} 

— E Ex n (m),X n {m')y n \lTY n {y n )I T Y n \x n {rn>){y n ) > (13.192) 

= E E PxAx n (m))p X n(x n (m'))x 

m'y^m x n (in),x n (m'),y n 

p Y n lxn (y n \x n (m))I Tr (y n )I Yn]x n {ml) (y n ) (13.193) 

= E E PxAx n (m')) PY 4y n )I Tr (y n )I Tr]xn{m/) (y n ) (13.194) 

m'y^m x n (m'),y n 

The first inequality is from the union bound, and the first equality follows from the way 
that we select the random code: for every message m, the codewords are selected indepen- 
dently and randomly according to p X n so that the distribution for the joint random variable 
X n (m)X n (m')Y n is 

Pxn(x n (m))p X n(x n (m'))p Y n lX n(y n \x n (m)). (13.195) 

The second equality follows from marginalizing over X n (m). Continuing, we have 

< 2 -n[tf(y M ]£ J2 Pxn{x n (m'))I Tr ^ {ml) {y n ) (13.196) 

m'^m x n (m'),y n 

= 2 -n [H{ Y)- S] J- J2 PxAx n (m'))J2l Tr ^)(y n ) (13-197) 

m'^mx n (m') y n 

£2 _ n[H{Y) _s ]2 n[H(Y\x )+ 5]J2 J2 PxAx n (m')) (13.198) 

< \M\2- n ^ x ' Y ^ 2S l (13.199) 

The first inequality follows from the bound PY n (y n )lT Yn (y n ) < 2~ n ^ H ^ Y ^~^ that holds for 



typical sequences. The second inequality follows from the cardinality bound 
2n[H(Y\x)+8] on fa e conc iitionally typical set. The last inequality follows because 



rpY^x"^ 1 ) 

1 5 



< 



£ Px4x n (m')) = l, (13.200) 

x n (m') 

\M\ is an upper bound on ^ m ,- 1, and by the identity I(X; Y) = H(Y) - H(Y\X). Thus, 
we can make this error arbitrarily small by choosing the message set size \Ai\ = 2 n ^^ X ' Y ^~ 35 \ 



Putting everything together, we have the following bound on (13.179) 

e / = 2e + 2- n5 , (13.201) 

as long as we choose the message set size as given above. It follows that there exists a 
particular code with the same error bound on its average error probability. We can then 
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exploit an expurgation argument as discussed in Section 2.2.4 to convert an average error 
bound into a maximal one. Thus, we have shown the achievability of an (n, C(J\f) — 5',e') 
channel code for all S',e' > and sufficiently large n. Finally, as a simple observation, our 
proof above does not rely on whether the definition of conditional typicality employed is 
weak or strong. □ 

13.11 Concluding Remarks 

This chapter deals with many different definitions and flavors of typicality in the classical 
world, but the essential theme is Shannon's central insight — the application of the law of 
large numbers in information theory. Our main goal in information theory is to analyze the 
probability of error in the transmission or compression of information. Thus, we deal with 
probabilities and we do not care much what happens for all sequences, but we instead only 
care what happens for the likely sequences. This frame of mind immediately leads to the 
definition of a typical sequence and to a simple scheme for the compression of information — 
keep only the typical sequences and performance is optimal in the asymptotic limit. Despite 
the seemingly different nature of quantum information when compared to its classical coun- 
terpart, the intuition developed in this chapter carries over to the quantum world in the next 
chapter where we define several different notions of quantum typicality. 

13.12 History and Further Reading 

The book of Cover and Thomas contains a great presentation of typicality in the classical 



case [57J. The proof of Property 13.7.5 is directly from the Cover and Thomas book. Berger 
introduced strong typicality [37], and Csiszar and Korner systematically developed it [58J. 
Other useful books on information theory are that of Berger [36J and Yeung [273J. There 
are other notions of typicality which are useful, including those presented in Ref. [89J and 
Ref. |264j . Our proof of Shannon's channel capacity theorem is similar to that in Savov's 
thesis |2TT] . 
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Quantum Typicality 



This chapter marks the beginning of our study of the asymptotic theory of quantum infor- 
mation, where we develop the technical tools underpinning this theory The intuition for it 
is similar to the intuition we developed in the previous chapter on typical sequences, but we 
will find some important differences between the classical and quantum cases. 

So far, there is not a single known information processing task in quantum Shannon 
theory where the tools from this chapter are not helpful in proving the achievability part of 
a coding theorem. For the most part, we can straightforwardly import many of the ideas from 
the previous chapter about typical sequences for use in the asymptotic theory of quantum 
information. Though, one might initially think that there are some obstacles to doing so. For 
example, what is the analogy of a quantum information source? Once we have established 
this notion, how would we determine if a state emitted from a quantum information source 
is a typical state? In the classical case, a simple way of determining typicality is to inspect 
all of the bits in the sequence. But there is a problem with this approach in the quantum 
domain — "looking at quantum bits" is equivalent to performing a measurement and doing so 
destroys delicate superpositions that we would want to preserve in any subsequent quantum 
information processing task. 

So how can we get around the aforementioned problem and construct a useful notion of 
quantum typicality? Well, we should not be so destructive in determining the answer to a 
question when it has only two possible answers. After all, we are only asking "Is the state 
typical or not?" , and we can be a bit more delicate in the way that we ask this question. As 
an analogy, suppose Bob is curious to determine whether Alice could join him for dinner at a 
nice restaurant on the beach. He would likely just phone her and politely ask, "Sweet Alice, 
are you available for a lovely oceanside dinner?", as opposed to barging into her apartment, 
probing through all of her belongings in search of her calendar, and demanding that she join 
him if she is available. This latter infraction would likely disturb her so much that she would 
never speak to him again (and what would become of quantum Shannon theory without 
these two communicating!). It is the same with quantum information — we must be gentle 
when handling quantum states. Otherwise, we will disturb the state so much that it will not 
be useful in any future quantum information processing task. 

379 
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We can gently ask a binary question to a quantum system by constructing an incomplete 
measurement with only two outcomes. If one outcome has a high probability of occurring, 
then we do not learn much about the state after learning this outcome, and thus we would 
expect that this inquiry does not disturb the state very much. For the case above, we can 
formulate the question, "Is the state typical or not?" as a binary measurement that returns 
only the answer to this question and no more information. Since it is highly likely that the 
state is indeed a typical state, we would expect this inquiry not to disturb the state very 
much, and we could use it for further quantum information processing tasks. This is the 
essential content of this chapter, and there are several technicalities necessary to provide a 
rigorous underpinning. 

We structure this chapter as follows. We first discuss the notion of a typical subspace (the 
quantum analogy of the typical set). We can employ weak or strong notions of typicality 



in the definition of quantum typicality. Section |14.2| then discusses conditional quantum 
typicality, a form of quantum typicality that applies to quantum states chosen randomly 
according to a classical sequence. We end this chapter with a brief discussion of the method 
of types for quantum systems. All of these developments are important for understanding 
the asymptotic nature of quantum information and for determining the ultimate limits of 
storage and transmission with quantum media. 

14.1 The Typical Subspace 

Our first task is to establish the notion of a quantum information source. It is analogous to 
the notion of a classical information source, in the sense that the source randomly outputs 
a quantum state according to some probability distribution, but the states that it outputs 
do not necessarily have to be distinguishable as in the classical case. 

Definition 14.1.1 (Quantum Information Source). A quantum information source is some 
device that randomly emits pure audit states in a Hilbert space TCx of size \X\. 

We use the symbol X to denote the quantum system for the quantum information source 
in addition to denoting the Hilbert space in which the state lives. Suppose that the source 
outputs states \t/j y ) randomly according to some probability distribution py{y)- Note that 
the states 1^) do not necessarily have to form an orthonormal set. Then the density operator 
p x of the source is the expected state emitted: 

p x = E y {hMftM} = J>y(2/)|^}(^|. (14.1) 

y 

There are many decompositions of a density operator as a convex sum of rank-one projec- 
tors (and the above decomposition is one such example), but perhaps the most important 
decomposition is a spectral decomposition of the density operator p: 

p x = J2p*w\ x )( x \ x - ( 14 - 2 ) 
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The above states |x) are eigenvectors of p x and form a complete orthonormal basis for 
Hilbert space Hx, and the non-negative, convex real numbers px(x) are the eigenvalues 
of p x . 

We have written the states |x) and the eigenvalues px(x) in a suggestive notation 
because it is actually possible to think of our quantum source as a classical information 

V" 

source — the emitted states {\x) } x ex are orthonormal and each corresponding eigenvalue 
Px{x) acts as a probability for choosing \x) . We can say that our source is classical because 
it is emitting the orthogonal, and thus distinguishable, states \x) with probability px(x). 
This description is equivalent to the ensemble {py{y), l^y)} because the two ensembles lead 
to the same density operator (recall that two ensembles that have the same density operator 
are essentially equivalent because they lead to the same probabilities for outcomes of any 
measurement performed on the system). Our quantum information source then corresponds 
to the pure-state ensemble: 

\px(x),\x) X } . (14.3) 

l > xeX 

Recall that the von Neumann entropy H(X) of the density operator p x is as follows 



(Definition 11.1.1): 



H(X) p = -Tr{p x logp x }. (14.4) 

It is straightforward to show that the von Neumann entropy H(X) is equal to the Shannon 

entropy H(X) of a random variable X with distribution px{x) because the basis states \x) 
are orthonormal. 

Suppose now that the quantum information source emits a large number n of random 
quantum states so that the density operator describing the emitted state is as follows: 



pX „ = f *»- . .g,»- , = (M5) 

n times 



The labels Xi, . . . , X n denote the Hilbert spaces in which the different quantum systems 
live, but the density operator is the same for each Hilbert space Xi, . . . , X n and is equal to 
p x . The above description of a quantum source is within the independent and identically 
distributed (IID) setting for the quantum domain. The spectral decomposition of the state 



in (14.5) is as follows: 

P Xn = H Px{xi)\x 1 ){x 1 \ Xl <g> • • • <g> ^2 Px(x n )\x n )(x n \ Xn (14.6) 

xi£X X n £zX 

= ^2 Px{x 1 )---p x (x n )\x 1 )---\x n )(x 1 \---(x n \ Xl, -' Xn (14.7) 

xx t "- ,x n aX 

= E Px*{x n )\x n )(x n \ x \ (14.8) 

x n ex n 

where we employ the shorthand: 

p X n(x n ) = p x ( Xl ) ■ ■■PxM, \x n f n = ki) Xl • • • W) Xn - (14.9) 
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The above quantum description of the density operator is essentially equivalent to the clas- 
sical picture of n realizations of random variable X with each eigenvalue px^Xi) ■ ■ ■ Px n (x n ) 
acting as a probability because the set of states {|xi) • • • \x n ) 1 '""' n } Xl ,--- ,x n ex is an orthonor- 
mal set. 

We can now "quantize" or extend the notion of typicality to the quantum information 



source. The definitions follow directly from the classical definitions in Chapter 13 The 



quantum definition of typicality can employ either the weak notion as in Definition 13.2.3 



or the strong notion as in Definition 13.7.2 We do not distinguish the notation for a typical 
subspace and a typical set because it should be clear from context which kind of typicality 
we are employing. 

Definition 14.1.2 (Typical Subspace). The 5-typical subspace T* n is a subspace of the full 
Hilbert spa ce Xr , . . . , X n and is associated with many copies of a density operator, such as 



vn 

the one in (14-2). It is spanned by states \x n ) whose corresponding classical sequences x n 
are 5-typical: 

Tf l = span||x n ) X " : x n E Tf"}, (14.10) 

where it is implicit that the typical subspace Tg on the LHS is with respect to a density 
operator p and the typical set T* on the RHS is with respect to the distribution px(x) from 



the spectral decomposition of p in (14-2). We could also denote the typical subspace as T* s 
if we would like to make the dependence of the space on p more explicit. 

14.1.1 The Typical Subspace Measurement 



The definition of the typical subspace (Definition 14.1.2) gives a way to divide up the Hilbert 
space of n qudits into two subspaces: the typical subspace and the atypical subspace. The 
properties of the typical subspace are similar to what we found for the properties of typical 
sequences. That is, the typical subspace is exponentially smaller than the full Hilbert space 
of n qudits, yet it contains nearly all of the probability (in a sense that we show below). 
The intuition for these properties of the typical subspace is the same as it is classically, as 



depicted in Figure |13.2[ once we have a spectral decomposition of a density operator. 

The typical projector is a projector onto the typical subspace, and the complementary 
projector projects onto the atypical subspace. These projectors play an important opera- 
tional role in quantum Shannon theory because we can construct a quantum measurement 
from them. That is, this measurement is the best way of asking the question, "Is the state 
typical or not?" because it minimally disturbs the state while still retrieving this one bit of 
information. 

Definition 14.1.3 (Typical Projector). LetHf n denote the typical projector for the typical 
subspace of a density operator p x with spectral decomposition in (14-2). It is a projector 
onto the typical subspace: 

nf= E l* B ><* B r> ( 14 - n ) 

x n eT* n 
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where it is implicit that the x n below the summation is a classical sequence in the typical 
set Tg , and the state \x n ) is a quantum state given in (14.9) and associated with the the 



typical projector as H x $ if we would like to make its dependence on p explicit. 

The action of multiplying the density operator p xn by the typical projector Tlf n is to 



classical sequence x n via the spectral decomposition of p in (14-2). We can also denote the 

select out all the basis states of p xn that are in the typical subspace and form a "sliced" 
operator p x ' 1 that is close to the original density operator p x ": 

jj xn = nfp xn nr= J2 PxAx«)\x«)(x"\ xn . (14.12) 

That is, the effect of projecting a state onto the typical subspace T x " is to "slice" out any 
component of the state p xn that does not lie in the typical subspace T x " . 

Exercise 14.1.1 Show that the typical projector Tlf commutes with the density operator p x ' 

P n 5 = n 5 P ■ (14.13) 

The typical projector allows us to formulate an operational method for delicately asking 
the question: "Is the state typical or not?" We can construct a quantum measurement that 
consists of two outcomes: the outcome "1" reveals that the state is in the typical subspace, 
and "0" reveals that it is not. This typical subspace measurement is often one of the first 
important steps in most protocols in quantum Shannon theory. 

Definition 14.1.4 (Typical Subspace Measurement). The following map is a quantum in- 



strument (see Section 4-4-7) that realizes the typical subspace measurement: 



* - (i - nf >(/ - nf ) ® |o)(o| + nfVnf ® |i)(i|, (14.14) 

where a is some quantum state living in the Hilbert space X n . It associates a classical register 
with the outcome of the measurement — the value of the classical register is |0) for the support 
of the state a that is not in the typical subspace, and it is equal to |1) for the support of the 
state a that is in the typical subspace. 

The implementation of a typical subspace measurement is currently far from the reality 
of what is experimentally accessible if we would like to have the measure concentration 
effects necessary for proving many of the results in quantum Shannon theory. Recall from 



Figure |13.1| that we required a sequence of about a million bits in order to have the needed 
measure concentration effects. We would need a similar number of qubits emitted from 
a quantum information source, and furthermore, we would require the ability to perform 
noiseless coherent operations over about a million or more qubits in order to implement 
the typical subspace measurement. Such a daunting requirement firmly places quantum 
Shannon theory as a "highly theoretical theory," rather than being a theory that can make 
close connection to current experimental practicejj 

^e should note that this was certainly the case as well for information theory when Claude Shannon 
developed it in 1948, but in the many years since then, there has been much progress in the development of 
practical classical codes for achieving the classical capacity of a classical channel. 
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14.1.2 The Difference between the Typical Set and the Typical 
Subspace 

We now offer a simple example to discuss the difference between the classical viewpoint asso- 
ciated with the typical set and the quantum viewpoint associated with the typical subspace. 
Suppose that a quantum information source emits the state |+) with probability 1/2 and it 
emits the state |0) with probability 1/2. For the moment, let us ignore the fact that the two 
states |+) and |0) are not perfectly distinguishable and instead suppose that they are. Then 
it would turn out that nearly every sequence emitted from this source is a typical sequence 
because the distribution of the source is uniform. Recall that the typical set has size roughly 
equal to 2 nH<yX \ and in this case, the entropy of the distribution (|, |) is equal to one bit. 
Thus the size of the typical set is roughly the same as the size of the set of all sequences for 
this distribution because 2 nH ^ x ' ) = 2 n . 

Now let us take into account the fact that the states |+) and |0) are not perfectly 



distinguishable and use the prescription given in Definition |14.1.2| for the typical subspace. 
The density operator of the above ensemble is as follows: 

h+)(+\ + ho)(0\=\i fl, (14.15) 

where its matrix representation is with respect to the computational basis. The spectral 
decomposition of the density operator is 

cos 2 (7r/8)|^o}(^o| + sin 2 (7r/8)|Vi}(Vi|, (14-16) 

where the states j^o) an d |^i} are orthogonal, and thus distinguishable from one another. 
The quantum information source that outputs |0) and |+) with equal probability is thus 
equivalent to a source that outputs |?/>o} with probability cos 2 (-7r/8) and \ipi) with probability sin 2 (7r/c 

We construct the projector onto the typical subspace by taking sums of typical strings 
of the states \ipo)(ipo\ and |^ 1 }(^ 1 | rather than the states |0)(0| and |+)(+|, where typicality 
is with respect to the distribution (cos 2 (7r/8),sin 2 (7r/8)). The dimension of the typical 
subspace corresponding to the quantum information source is far different from the size of 
the aforementioned typical set corresponding to the distribution (1/2,1/2). It is roughly 
equal to 2 06n because the entropy of the distribution (cos 2 (-7r/8), sin 2 (7r/8)) is about 0.6 bits. 
This stark contrast in the sizes has to do with the non-orthogonality of the states from the 
original description of the ensemble. That is, non-orthogonality of states in an ensemble 
implies that the size of the typical subspace can potentially be dramatically smaller than the 
size of the typical set corresponding to the distribution of the states in the ensemble. This 
result of course has implications for the compressibility of quantum information, and we will 



discuss these ideas in more detail in Chapter [TTj For now, we continue with the technical 
details of typical subspaces. 
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14.1.3 Properties of the Typical Subspace 

The typical subspace T$ enjoys several useful properties that are "quantized" versions of 
the typical sequence properties: 

Property 14.1.1 (Unit Probability) Suppose that we perform a typical subspace mea- 
surement of a state p x " . Then the probability that the quantum state p xn is in the typical 
subspace Tf n approaches one as n becomes large: 

Ve > Tr{nf n p X "} > 1 - e for sufficiently large n, (14.17) 



where Ilj^" is the typical subspace projector from Definition 14.1.3 

Property 14.1.2 (Exponentially Small Dimension) The dimension din^T*") of the 
5-typical subspace is exponentially smaller than the dimension \X\ n of the entire space of 
quantum states when the output of the quantum information source is not maximally mixed. 
We formally state this property as follows: 

Trjnf'} < 2-™ +c5 \ (14.18) 

where c is some constant that depends on whether we employ the weak or strong notion of 
typicality. We can also lower bound the dimension dira(T xn ) of the ^-typical subspace when 
n is sufficiently large: 

Ve > Tr{nf "} > (1 - e )2 n{H ^ x) - cS) for sufficiently large n. (14.19) 

Property 14.1.3 (Equipartition) The operator Uf" p Tig corresponds to a "slicing" 
of the density operator p x ' 1 where we slice out and keep only the part with support in the 
typical subspace. We can then bound all of the eigenvalues of the sliced operator H x p x U x 
as follows: 

2 -n(H(x) + cS) Il x« < nf p x "nf" < 2 -^w- c5 )nf n . (14.20) 

The above inequality is an operator inequality. It is a statement about the eigenvalues of 
the operators Hf n p Tlf and Ug , and these operators have the same eigenvectors because 
they commute. Therefore, the above inequality is equivalent to the following inequality that 
applies in the classical case: 

Vx n G Tf l : 2- n(H(x)+cS) < p X n(x n ) < 2- n ^ H ^- cS \ (14.21) 

This equivalence holds because each probability px n {% n ) is an eigenvalue of Tlf" p x "ll xn . 

The dimension dim(T xn ) of the 5-typical subspace is approximately equal to the dimen- 
sion \X\ n of the entire space only when the density operator of the quantum information 
source is maximally mixed because 

Tr{nf " } < | X\ n ■ 2 nS ~ \X\ n . (14.22) 

The proofs of the above properties are essentially identical to those from the classical 



case in Sections 13.7.3 and 13.9.3. regardless of whether we employ a weak or strong notion 



of quantum typicality. We leave the proofs as the three exercises below. 
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Exercise 14.1.2 Prove the Unit Probability Property of the 5- typical subspace (Prop- 



erty 14.1.1). First show that the probability that many copies of a density operator is 



in the 5-typical subspace is equal to the probability that a random sequence is 5-typical: 

TrjnfV'} = Pr{X n G Tf"}. (14.23) 

Exercise 14.1.3 Prove the Exponentially Small Dimension Property of the ^-typical sub- 



space (Property |14.1.2[ ). First show that the trace of the typical projector II ^ is equal to 

dim(rr)=Tr{nf}. (14.24) 



the dimension of the typical subspace T x " 



Then prove the property. 



Exercise 14.1.4 Prove the Equipartition Property of the 5- typical subspace (Property 14.1.3) 
First show that 

ltfV"nr= E Pxn(x n )\x n )(x n \ xn , (14.25) 

and then argue the proof. 

The result of the following exercise shows that the sliced operator p xn = Tlf p x Hf is 
a good approximation to the original state p xn in the limit of many copies of the states, and 



it effectively gives a scheme for quantum data compression (more on this in Chapter 17) 



Exercise 14.1.5 Use the Gentle Operator Lemma (Lemma 9.4.2) to show that p x ' 1 is 2-^/e- 
close to the sliced operator p x ' 1 when n is large: 

||p xn -p x "|| 1 <2 V / e. (14.26) 



Use the Gentle Measurement Lemma (Lemma 9.4.1) to show that the sliced state 

[Tr{nfV"}]-y* (14.27) 

is 2-^6-01086 in trace distance to p xn . 

Exercise 14.1.6 Show that the purity Tr< (p X ") \ of the sliced state p x ' 1 satisfies the 
following bound for sufficiently large n and any e > (use weak quantum typicality): 

(1 - € )2-^W+S) < Tr{(p x ") 2 } < 2~< H ^- S \ (14.28) 

Exercise 14.1.7 Show that the following bounds hold for the zero-norm and the oo-norm 
of the sliced state p xn for any e > and sufficiently large n: 

(1 _ e ) 2 <H{X)-S) < ||~X«|| o < 2 n(H { X )+ S)^ (u 2g) 

2 -n(H (X)+S) K II ~X n II <2 -n(H(X)-6) (U 30) 

— I" lira — " \ ' J 

(Recall that the zero-norm of an operator is equal to the size of its support and that the 
infinity norm is equal to its maximum eigenvalue. Again use weak quantum typicality.) 
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14.1.4 The Typical Subspace for Bipartite or Multipartite States 

Recall from Section |13.5| that two classical sequences x n and y n are weak jointly typical if 
the joint sample entropy of x n y n is close to the joint entropy H(X,Y) and if the sample 
entropies of the individual sequences are close to their respective marginal entropies H(X) 
and H{Y) (where the entropies are with respect to some joint distribution px,y(x, V))- How 
would we then actually check that these conditions hold? The most obvious way is simply 
to look at the sequence x n y n , compute its joint sample entropy, compare this quantity to 
the true joint entropy, determine if the difference is under the threshold 5, and do the same 
for the marginal sequences. These two operations both commute in the sense that we can 
determine first if the marginals are typical and then if the joint sequence is typical or vice 
versa without any difference in which one we do first. 

But such a commutation does not necessarily hold in the quantum world. The way 
that we determine whether a quantum state is typical is by performing a typical subspace 
measurement. If we perform a typical subspace measurement of the whole system followed 
by such a measurement on the marginals, the resulting state is not necessarily the same as 
if we performed the marginal measurements followed by the joint measurements. For this 



reason, the notion of weak joint typicality as given in Definition 13.5.3 does not really exist in 
general for the quantum case. Nevertheless, we still briefly overview how one would handle 
such a case and later give an example of a restricted class of states for which weak joint 
typicality holds. 

Suppose that we have a quantum system in the mixed state p XY shared between two 
parties X and Y. We can decompose the mixed state with the spectral theorem: 

zez 

where the states {\ifi z ) }z&z form an orthonormal basis for the joint quantum system XY 
and each of the states \ip z ) can be entangled in general. 

We can consider the n th extension p xnyn of the above state and abbreviate its spectral 
decomposition as follows: 

p XnYn - ( P XY f n = £ Pz4z n )\^m»r Y \ (14-32) 



z n eZ" 



where 



Pz^ n )=Pz(zi)---pz(z n ), (14.33) 

hM X " y " = \M* lYl ■ ■ ■ \i>z n ) XnYn - (14.34) 

This development immediately leads to the definition of the typical subspace for a bipartite 
state. 

Definition 14.1.5 (Typical Subspace of a Bipartite State). The 5-typical subspace Tf nYn 
of p XY is the space spanned by states \ifi z n) whose corresponding classical sequence z n is 
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in the typical set Tf n : 



Tf nyn = span{|^) xnyn : z n e if }• (14.35) 



The states \ip z n) ' are from the spectral decomposition of p XY , and the distribution to 
consider for typicality of the classical sequence z n is pz{ z ) from the spectral decomposition. 

Definition 14.1.6 (Typical Projector of a Bipartite State). Let H xnYn denote the projector 
onto the typical subspace of p XY : 

nr^ e i«<^r y ". (14.36) 

Thus, there is ultimately no difference between the typical subspace for a bipartite state 
and the typical subspace for a single-party state because the spectral decomposition gives a 
way for determining the typical subspace and the typical projector in both cases. Perhaps 
the only difference is a cosmetic one because XY denotes the bipartite system while Z indi- 
cates a random variable with a distribution given from the spectral decomposition. Finally, 



Properties 14.1.1 14.1.3 hold for quantum typicality of a bipartite state. 



14.1.5 The Jointly Typical Subspace for Classical States 

The notion of weak joint typicality may not hold in the general case, but it does hold for a 
special class of states that are completely classical. Suppose now that the mixed state p XY 
shared between two parties X and Y has the following special form: 

p XY = E !>*.*(*, v)(l*> ® \y))(( x \ ® (y\) XY ( 14 -37) 

= EE^H^*}^!* ® \y)(y\ Y , (14-38) 

where the states {\x) } x &x and {|y} } y ^y form an orthonormal basis for the respective 
systems X and y. This state has only classical correlations because Alice and Bob can 
prepare it simply by local operations and classical communication. That is, Alice can sample 
from the distribution px,Y (x, y) in her laboratory and send Bob the variable y. Furthermore, 
the states on X and Y locally form a distinguishable set. 

We can consider the n th extension p x Y of the above state: 

p xnyn = {p XY f n (14.39) 

= E Px*M* n ,ir)(\x n ) ® \y n ))((x n \ ® (y n \f" Yn (u.4o) 

x n ex n ,y n ey n 

= E px*,Y«(x n ,y n )K)(x n \ xn ®\y n ){y n \ Yn - (14.41) 

This development immediately leads to the definition of the weak jointly typical subspace 
for this special case. 
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Definition 14.1.7 (Jointly Typical Subspace). The weak 5 -jointly-typical subspace T) 



ynyn 



is the space spanned by states \x n )\y n ) 
in the jointly-typical set: 



y-n-yn 



whose corresponding classical sequence x n y n is 



Tf nyn = span{\x n )\y n ) xnYn ■ *V G Tf Yn ). 



(14.42) 



Definition 14.1.8 (Jointly Typical Projector). LetHf denote the jointly-typical projec- 
tor. It is the projector onto the jointly-typical subspace: 



n 



X nyn 



£ 



\x n ){x n \ xn ® \y n )(y n \ Yn . 



(14.43) 



Properties of the Jointly Typical Projector 



Properties 14.1.1 14.1.3 apply to the jointly-typical subspace T x because it is a typical 



subspace. The following property, analogous to Property 13.5.4 for classical joint typicality, 
holds because the state p XY has the special form in (14.37): 



Property 14.1.4 (Probability of Joint Typicality) Consider the following marginal 
density operators: 



P 



A"' 



Trm{p xnYn }, 



P 



Y" 



^{pry*}. 



Let us define p as the following density operator: 



P 



X n Y*> 



P 



X" 



p Yn * P 



X n Y r - 



(14.44) 



(14.45) 



The marginal density operators of p xnY " are therefore equivalent to the marginal density 
operators of p xnyn . Then we can bound the probability that the state p xnyn lies in the 
typical subspace Tf nyn : 



Tr{fLT yn p xnyn \ < 2 - n ^ x ' Y ~>- 3S l 



Exercise 14.1.8 Prove the bound in Property 14.1.4 



Tr{nf y V"^}<2- 



n(I(X;Y)-3S) 



(14.46) 



(14.47) 



14.2 Conditional Quantum Typicality 

The notion of conditional quantum typicality is somewhat similar to the notion of conditional 
typicality in the classical domain, but we again quickly notice some departures because 
different quantum states do not have to be perfectly distinguishable. The technical tools for 
conditional quantum typicality developed in this section are important for determining how 
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much public or private classical information we can send over a quantum channel (topics 



discussed in Chapters 19 and 22). 



We first develop the notion of a conditional quantum information source. Consider a 
random variable X with probability distribution px(x). Let X be the alphabet of the random 
variable, and let \X\ denote its cardinality. We also associate a quantum system X with the 
random variable X and use an orthonormal set (l^)}^^ to represent its realizations. We 
again label the elements of the alphabet X as {x} xeX . 

Suppose we generate a realization x of random variable X according to its distribution 
Px{x), and we follow by generating a random quantum state according to some conditional 
distribution. This procedure then gives us a set of \X\ quantum information sources (each of 



them are as in Definition 14.1.1). We index them by the classical index x, and the quantum 



information source has expected density operator p B if the emitted classical index is x. 
Furthermore, we impose the constraint that each p B has the same dimension. This quantum 
information source is therefore a "conditional quantum information source." Let Jis and B 
denote the respective Hilbert space and system label corresponding to the quantum output of 
the conditional quantum information source. Let us call the resulting ensemble the "classical- 
quantum ensemble" and say that a "classical-quantum information source" generates it. The 
classical-quantum ensemble is as follows: 



[Px{x), 



\x){x\ X ®p B \ , (14.48) 

J xeX 



where we correlate the classical state |x) with the density operator p B of the conditional 
quantum information source. The expected density operator of the above classical-quantum 



ensemble is the following classical- quantum state (discussed in Section 4.3.4) 



p xB = Y.'p^ x )\ x )( x \ x ®p*- ( 14 - 49 ) 

x&X 

The conditional quantum entropy H(B\X) of the classical-quantum state p XB is as fol- 
lows: 

H(B\X) = Y,Px{x)H{B\X = x) = J2'Px(x)H(p*). {u 5Q) 

x&X x£X 

We can write the spectral decomposition of each conditional density operator p B as follows: 

J2pY\x(y\x)\y x ){y x \ B , (14.51) 

where the elements of the set {y} ye y label the elements of an alphabet y, the orthonor- 
mal set {\y x ) } y ey is the set of eigenvectors of p B , and the corresponding eigenvalues are 
{PY\x{y\x)} y . We need the x label for the orthonormal set {\y x ) }yey because the de- 
composition may be different for different density operators p B . The above notation is again 
suggestive because the eigenvalues PY\x{y\x) correspond to conditional probabilities, and the 
set {\Vx) } °f eigenvectors corresponds to an orthonormal set of quantum states conditioned 
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on label x. With this respresentation, the conditional entropy H(B\X) reduces to a formula 
that looks like that for the classical conditional entropy: 

H{B\X) = Y,Px{x)H{p B x ) (14.52) 

x&X 

= Y] Px(x)p Y \x(y\x) log -—-. (14.53) 

J&& py\x(y\x) 

We now consider when the classical-quantum information source emits a large number n 
of states. The density operator for the output state p xnB ' 1 is as follows: 



X n B r ' 



( XB\® n 

^ Px(xi)\x 1 )(xi\ Xl <g) Pxl J <g> ■ ■ ■ <8> I ^ Px(Xn)\Xn){x n \ 

x x ex J \x„ex 

^ Px(xi)---px(x„)|xi) ••• |x n )(xi| ••• (a: n | x <g> (p^ 1 <8) 





(14.54) 


pf; ) 


(14.55) 


pf;)- 


(14.56) 



Xi,...,x n £A' 

We can abbreviate the above state as 



Y, Px<x n )\x n ){x n \ xn ®p B x :, (14.57) 



x n ex n 

where 



Px-{x n ) =Px{xi)---px{x„), (14.58) 

\x n ) xn = \xi) Xl ■ ■ ■ \x n ) X \ p B : = p B l ® • • • ® pf;, (14.59) 



and the spectral decomposition for the state p^„™ is 



on 

Px« 



£ py«|xn(y n |^)|^)(y:„| Bn , (14.60) 

y n e yn 



where 



p Y n\ X n(y n \x n ) =p Yl \x 1 (vi\xi)---pY n \x n (v n \x n ), (14.61) 

\y n x-) Bn ^\yix 1 ) Bl ---\vnx n ) Bn - (14.62) 

The above developments are a step along the way for formulating the definitions of weak 
and strong conditional quantum typicality. 
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14.2.1 Weak Conditional Quantum Typicality 

We can "quantize" the notion of weak classical conditional typicality so that it applies to a 
classical-quantum information source. 

Definition 14.2.1 (Weak Conditionally Typical Subspace). The conditionally typical sub- 



space T 5 corresponds to a particular sequence x n and an ensemble {px{x),Px}- It 



is 



,B 



the subspace spanned by the states \y™n) whose conditional sample entropy is 5-close to the 
true conditional quantum entropy: 

TP xn = span{|^) Bn : \H(y n \x n ) - H(B\X)\ < *}, (14.63) 

where the states |y"n) are formed from the eigenstates of the density operators pf (they are 



from (14.51). 



of the form in (14-62)) and the sample entropy is with respect to the distribution Py\x(v\x) 



Definition 14.2.2 (Weak Conditionally Typical Projector). The projector H d onto the 

conditionally typical subspace T s is as follows: 

nf 1 ^ £ I^X^r (14-64) 



n B"-\x n 

L s 



14.2.2 Properties of the Weak Conditionally Typical Subspace 

The weak conditionally typical subspace T s enjoys several useful properties that are 

"quantized" versions of the properties for weak conditionally typical sequences discussed in 



Section 13.6 We should point out that we cannot really say much for several of the properties 
for a particular sequence x n , but we can do so on average for a random sequence X n . Thus, 
several of the properties give expected behavior for a random sequence X n . This convention 
for quantum weak conditional typicality is the same as we had for classical weak conditional 
typicality in Section |13.6[ 

Property 14.2.1 (Unit Probability) The expectation of the probability that we measure 
a random quantum state p x Z to be in the conditionally typical subspace T s approaches 

one as n becomes large: 

Ve > E X n[Tr[nf n|X "pfn}} > 1 -e for sufficiently large n. (14.65) 

Property 14.2.2 (Exponentially Small Dimension) The dimension dim(T 5 ) of the 
^-conditionally typical subspace is exponentially smaller than the dimension \y\ n of the entire 
space of quantum states for most classical-quantum sources. We formally state this property 
as follows: 

Tr{uf nlxn }<r^ B ^ +5 l (14.66) 
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We can also lower bound the dimension dim(T 5 ) of the 5- conditionally typical subspace 
when n is sufficiently large: 

Ve > E xn [Tr[nf n|Xn }} > (1 - e )2 n ^ B \ x ^^ for sufficiently large n. (14.67) 

Property 14.2.3 (Equipartition) The density operator p B „ looks approximately maxi- 
mally mixed when projected to the conditionally typical subspace: 

2-n(H(B\x) +S ) U p* n < nf^^nf 1 *" < 2 - n ^ B \ x ^nP xn . (u.68) 

Exercise 14.2.1 Prove all three of the above properties for weak conditional quantum typ- 
icality. 

14.2.3 Strong Conditional Quantum Typicality 

We now develop the notion of strong conditional quantum typicality. This notion again 



applies to an ensemble or to a classical-quantum state such as that given in (14.49). Though, 
it differs from weak conditional quantum typicality because we can prove stronger statements 
about the asymptotic behavior of conditional quantum systems (just as we could for the 



classical case in Section 13.9). We begin this section with an example to build up our 
intuition. We then follow with the formal definition of strong conditional quantum typicality, 
and we end by proving some properties of the strong conditionally typical subspace. 



Recall the example from Section 13.7 In a similar way to this example, we can draw a 



sequence from an alphabet {0, 1, 2} according to the following distribution: 

Px(0) = \, Px{l) = \, Px(2)= 1 -. (14.69) 

One potential realization sequence is as follows: 

201020102212. (14.70) 

The above sequence has four "zeros," three "ones," and five "twos," so that the empirical 
distribution of this sequence is (1/3, 1/4,5/12) and has maximum deviation 1/12 from the 



true distribution in (14.69). 

For each symbol in the above sequence, we could then draw from one of three quantum 
information sources based on whether the classical index is 0, 1, or 2. Suppose that the 
expected density operator of the first quantum information source is po? that of the second 
is pi, and that of the third is p 2 - Then the density operator for the resulting sequence of 
quantum states is as follows: 

P2 1 ® Po 2 ® Pi 3 ® Po 4 ® P2 5 ® Po 6 ® Pi 7 ® Po a ® P2 9 ® P2 10 ® Pi 11 ® pf 12 , (14.71) 

where the superscripts label the quantum systems in which the density operators live. So, 
the state of systems J5i, B 5 , B$, B 10 , and B 12 is equal to five copies of p 2 , the state of systems 
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B 2 , -B4, B 6 , and B% is equal to four copies of po, and the state of systems B 3 , B 7 , and Bn 
is equal to three copies of p\. Let I x be an indicator set for each x G {0, 1,2}, so that I x 
consists of all the indices in the sequence for which a symbol is equal to x. For the above 
example, 

I = {2, 4, 6, 8}, h = {3, 7, 11}, I 2 = {1, 5, 9, 10, 12}. (14.72) 

These sets serve as a way of grouping all of the density operators that are the same because 
they correspond to the same classical symbol, and it is important to do so if we would like to 
consider concentration of measure effects when we go to the asymptotic setting. As a visual 



aid, we could permute the sequence of density operators in (14.71) if we would like to see 
systems with the same density operator grouped together: 

Po 2 ® Po 4 ® Po 6 ® Po a ® Pf 3 ® Pi 7 ® Pi 11 ® pf 1 ® P? ® Pf 9 ® Pf 10 ® Pf 12 - (14-73) 

There is then a typical projector for the first four systems with density operator po? a different 
typical projector for the next three systems with density operator pi, and an even different 
typical projector for the last five systems with density operator p 2 (though, the length of the 
above quantum sequence is certainly not large enough to observe any measure concentration 
effects!). Thus, the indicator sets I x serve to identify which systems have the same density 
operator so that we can know upon which systems a particular typical projector should act. 

This example helps build our intuition of strong conditional quantum typicality, and we 
can now begin to state what we would expect in the asymptotic setting. Suppose that the 
original classical sequence is large and strongly typical, so that it has roughly n/4 occurrences 
of "zero," n/4 occurrences of "one," and n/2 occurrences of "two." We would then expect 
the law of large numbers to come into play for n/4 and n/2 when n is large enough. Thus, 
we can use the classical sequence to identify which quantum systems have the same density 
operator, and apply a typical projector to each of these subsets of quantum systems. Then 
all of the useful asymptotic properties of typical subspaces apply whenever n is large enough. 

We can now state the definition of the strong conditionally typical subspace and the 
strong conditionally typical projector, and we prove some of their asymptotic properties by 
exploiting the properties of typical subspaces. 

Definition 14.2.3 (Strong Conditionally Typical Subspace). The strong conditionally typ- 
ical subspace corresponds to a sequence x n and an ensemble {px{x),Px }■ Let the spectral 



decomposition of each state p% be as in (14-51) with distribution Py\x{u\x) an d corresponding 



eigenstates \y x ). The strong conditionally typical subspace T s is then as follows: 

TP xn ee span((g) \yi*) Bl * : Vx, ^ G rf W""'}, (14.74) 



. x&X 



where I x = {i : Xj = x} is an indicator set that selects the indices i in the sequence x n for 
which the i th symbol X{ is equal to x G X , B Ix selects the systems from B n where the classical 
sequence x n is equal to the symbol x, \yi x ) is some string of states from the set {\y x )}, y Ix 
is a classical string corresponding to this string of states, Y\x is a random variable with 
distribution py\x{v\x) , and \I X \ is the cardinality of the indicator set I x . 
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Definition 14.2.4 (Strong Conditionally Typical Projector). The strong conditionally typi- 
cal projector again corresponds to a sequence x 11 and an ensemble {px(x), p%}- It is a tensor 
product of typical projectors for each state p% in the ensemble: 



n 



B n \x n 



®n£, 



(14.75) 



xeX 



where I x is defined in Definition 
typical projector for p x projects 



ff 



14-2.3, and B Ix indicates the systems onto which a particular 



14.2.4 Properties of the Strong Conditionally Typical Subspace 

The strong conditionally typical subspace admits several useful asymptotic properties similar 
to what we have seen before, and the proof strategy for proving all of them is similar to the 
way that we proved the analogous properties for the strong conditionally typical set in 



Section 13.9.3. Suppose that we draw a sequence x n from a probability distribution px(x), 



and we are able to draw as many samples as we wish so that the sequence x n is strongly 
typical and the occurrences N(x\x n ) of each symbol x are as large as as we wish. Then the 
following properties hold. 

Property 14.2.4 (Unit Probability) The probability that we measure a quantum state 
p®2 to be in the conditionally typical subspace T 5 approaches one as n becomes large: 

Ve > Trlnf'^pf"} > 1 - e for sufficiently large n. (14.76) 

Property 14.2.5 (Exponentially Small Dimension) The dimension dim(T 5 ) of the 
^-conditionally typical subspace is exponentially smaller than the dimension \B\ n of the entire 
space of quantum states for all classical-quantum information sources besides ones where all 
their density operators are maximally mixed. We formally state this property as follows: 



Trjrr 8 " 1 *™) < 2 n(H{Blx)+s " ) . 



(14.77) 



We can also lower bound the dimension dim(r 5 
when n is sufficiently large: 



Y n \x n 



of the 8- conditionally typical subspace 



Ve > Tr[nf n|a;n } > (1 - e )2 n W B «- 5 ") for sufficiently large n. (14.78) 

Property 14.2.6 (Equipartition) The state p^„ is approximately maximally mixed when 
projected onto the strong conditionally typical subspace: 



2-n(H(B\X)+S")jjB' n \x n < ^B n \x n B^B n \x n < 2 -n(H(B\X)-8")jjB n \x n 



(14.79) 



2 Having the conditional density operators in the subscript breaks somewhat from our convention through- 
out this chapter, but it is useful here to indicate explicitly which density operator corresponds to a typical 
projector. 
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14.2.5 Proofs of the Properties of the Strong Conditionally Typ- 
ical Subspace 



Proof of the Unit Probability Property (Property 14-2.4)- The proof of this property is sim 



ilar to the proof of Property |13.9.1| for the strong conditionally typical set. Since we are 
dealing with an IID distribution, we can assume without loss of generality that the sequence 
x n is lexicographically ordered with an order on the alphabet X. We write the elements of 
X as <zi, . . . , a\x\- Then the lexicographic ordering means that we can write the sequence of 
quantum states p x n as follows: 



Px n 



P ai 



® Pen ® Pa 2 ® 



Pa 



P, 



a \x\ 



Pa\x 



1*1 



(14.80) 



N( ai \x n ) 



N{a 2 \x"-) 



N(a ]x[ \x n ) 



It follows that N(cii\x n ) > n(px{(ii) — S') from the typicality of x n , and thus the law of 
large numbers comes into play for each block Oj • • • Oj with length N(di\x n ). The strong 



conditionally typical projector II 



B n \x n 



for this system is as follows: 



n 



B n \x n 



n 



3 N(x\x n ) 
Px,S •■ 



(14.81) 



x£X 



because we assumed the lexicographic ordering of the symbols in the sequence x n . Each 
projector 11^ s * in the above tensor product is a typical projector for the density operator 
p x when N(x\x n ) ~ npx{x) becomes very large. Then we can apply the Unit Probability 



Property (Property 14.1.1) for each of these typical projectors, and it follows that 



Tr{pf:nfl""} 



Tr< 



rjB N ( x ^ n ) n ®N(x\x n ) 
il p x ,S Px 



.xex 



n^in^"^^* 1 ^} 



x&X 



\x\ 



> (1 - e) 

> 1- \X\e 



(14.82) 


(14.83) 


(14.84) 


(14.85) 



□ 



Proof of the Equipartition Property (Property 14-2.6). We first assume without loss of gen- 
erality that we can write the state p^n in lexicographic order as in (14.80). Then the strong 



conditionally typical projector is again as in (14.81). It follows that 



n 



B ri \x r 



P x nU 



B n \x" 



n 



B N(x\x n ) 
Px,S 



\x n ) u B N (^ n ) 



'n 



Px ,S 



(14.86) 



x£X 
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We can apply the Equipar tition Property of the typical subspace for each typical projector 



KT n) ( p -P-ty 



14.1.3): 



n p x ,s z -Vy n Px,s Px n Px ,8 



x&X x£X 



< (gjnJj^V^")^)-^ (14.87) 

x€X 

The following inequalities hold because the sequence x n is strongly typical as defined in 
Definition 113.7.21 

B N(*\z n ) 9 -n( Px (x)+S')(H{ Px )+c6) < n B ,l k" TjB"\x n 
ll p x ,5 L — il 5 Px ni - l 5 



x&X 



< 6dn BN t xn) '2~ niPx(x) ~ s ' ){Hipx) ~ cS) . (14. 



x€X 



We can factor out each term 2 n ( px ( x )+ s ')( h (p^+ cS ) from the tensor products: 

jj 2 -n {Px{ x) + 8>)(H(p x)+ cs) n ^ ( "'" n) < nf" lxn p x nui 

< "Q 2 -n( Px ( x )-s>)(H(p a )-cS) n^7'"" j . (14.89) 



jB N(x\x n ) ^ TJ B n \x n n -p^lz™ 
x&X x&X 

x&X x&X 



We then multiply out the \X\ terms 2- n (^( :r )+< 5 ')( // (^)+ c5 ): 

2 -n(H(B\X)+Y:JH(p x )8'+cp x (x)8+c5S')) I rB n \x- < TT^k" ^TT 5 " 1 ^ 

< 2 -n(//(B|X)+E :cC 5<5'-H(p ;c )<5'- C p x ( a; )5) n B"|x"_ (1490) 

The final step below follows because ^ x Px{x) = 1 and because the bound ^2H(p x ) < 
\X\ log d applies where d is the dimension of the density operator p x . 

2 -n(H(B\x) + 6") u f n \* n < uf n]xn Px nUf nlxn < 2-»<*( B l*)-*")nf '*", (14.91) 

where 

5" = <J'|#| logd + c£ + |#|aW'. (14.92) 

□ 



Exercise 14.2.2 Prove Property 14.2.5 
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14.2.6 Strong Conditional and Marginal Quantum Typicality 

We end this section on strong conditional quantum typicality by proving a final property 
that applies to a state drawn from an ensemble and the typical subspace of the expected 
density operator of the ensemble. 



Property 14.2.7 Consider an ensemble of the form {px(x), p x } with expected density op- 
erator p = J2 x Px( x )Px- Suppose that x n is a strongly typical sequence with respect to the 
distribution px(%) and leads to a conditional density operator p x n. Then the probability of 
measuring p x n in the strongly typical subspace of p is high: 



Ve > Tr{n™ 5 p x n) > 1 - e, for sufficiently large n, (14.93) 

where the typical projector H™ s is with respect to the density operator p. 

Proof. Let the expected density operator have the following spectral decomposition: 

p = J2Pz(z)\z)(z\. (14.94) 

z 

We define the "pinching" operation as a dephasing with respect to the basis {|-z}}: 

a^ A(a) = ^2\z)(z\a\z)(z\. (14.95) 

2 

Let p x denote the pinched version of the conditional density operators p x : 

Px = Hpx) = ^2\z){z\p x \z){z\ = ^2pz\x(z\x)\z)(z\, (14.96) 



where pz\x(z\x) = (z\p x \z). This pinching is the crucial insight for the proof because all of 
the pinched density operators p x have a common eigenbasis and the analysis reduces from a 
quantum one to a classical one that exploits the properties of strong marginal, conditional, 
and joint typicality. The following chain of inequalities then holds by exploiting the above 
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definitions: 



Tr{p xn Ul 6 } = Tr\p x n £ \z n )(z n \ } (14.97) 



z^Tf 



Tr\p x „ J2 \z n )(z n \z n ){z n \) ( 14 - 98 ) 



2 «eT/ 



Tr<J 22 \z n )(z n \p x n\z n )(z n \ } (14.99) 

Tr<( ^ Pz»|x»(* n |:c n )|* n )<* n | } (14-100) 

- ^ VZn\ X n(z n \x n ) (14.101) 

The first equality follows from the definition of the typical projector IP s . The second equality 
follows because |,2 n )(,2 n | is a projector, and the third follows from linearity and cyclicity of 
the trace. The fourth equality follows because 

n n 

(z n \p x n\z n ) = Y[(zi\p Xi \zi) = Y[Pz\x(zi\xt) = Pz"\xAz n \x n ). (14.102) 

j=l 8=1 

Now consider this final expression ^2 z n €T z^ pz^\x n { zTl \ xn )- It is equivalent to the probability 
that a random conditional sequence Z n \x n is in the typical set for pz{z)'- 

Pr{Z n \x n eTf}. (14.103) 

By taking n large enough, the law of large numbers guarantees that it is highly likely (with 
probability greater than 1 — e for any e > 0) that this random conditional sequence Z n \x n 
is in the conditionally typical set T s , for some 5 . It then follows that this conditional 
sequence has a high probability of being in the unconditionally typical set Tf because we 



assumed that the sequence x n is strongly typical and Lemma 13.9.1 states that a sequence z 



is unconditionally typical if x n is strongly typical and z n is strong conditionally typical. □ 

14.3 The Method of Types for Quantum Systems 

Our final development in this chapter is to establish the method of types in the quantum 



domain, and the classical tools from Section 13.7 have a straightforward generalization. 



We can partition the Hilbert space of n qudits into different type class subspaces, just as 
we can partition the set of all sequences into different type classes. For example, consider 
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the Hilbert space of three qubits. The computational basis is an orthonormal basis for the 
entire Hilbert space of three qubits: 

{|000}, |001), |010), |011), |100), |101), |110), |111)}. (14.104) 

Then the computational basis states with the same Hamming weight form a basis for each 
type class subspace. So, for the above example, the type class subspaces are as follows: 

T = {|000}}, (14.105) 

Ti = {|001),|010),|100)}, (14.106) 

T 2 = {|011},|101},|110}}, (14.107) 

T 3 = {|111)}, (14.108) 

and the projectors onto the different type class subspaces are as follows: 

n = |000}(000|, (14.109) 

n x = |ooi)(ooi| + |oio)(oio| + |ioo)(ioo|, (14.110) 

n 2 = |on)(oii| + |ioi)(ioi| + |iio)(no|, (14.111) 

n 3 = |in)(iii|. (14.112) 

We can generalize the above example to an n-fold tensor product of qudit systems with 
the method of types. 

Definition 14.3.1 (Type Class Subspace). The type class subspace is the subspace spanned 
by all states with the same type: 

T t xn = span{|x n ) : x n G if"}, (14.113) 

where the notation T" on the LHS indicates the type class subspace, and the notation T* n 
on the RHS indicates the type class of the classical sequence x n . 

Definition 14.3.2 (Type Class Projector). Let Tlf n denote the type class subspace projector: 

H7= J2 \^ n ){x n \. (14.114) 

Property 14.3.1 (Resolution of the Identity with Type Class Projectors) The sum 

of all type class projectors forms a resolution of the identity on the full Hilbert space 'H® n 
of n qudits: 

I = J2 U t, (14-115) 

t 

where / is the identity operator on 7i® n . 
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Definition 14.3.3 (Maximally Mixed Type Class State). The maximally mixed density 
operator proportional to the type class subspace projector is 

TTt^A-'nf", (14.H6) 

where D t is the dimension of the type class: 

D t = Tr{U? n }. (14.117) 



Recall from Definition 13.7.4 that a 5-typical type is one for which the empirical distri- 
bution has maximum deviation 5 from the true distribution, and t$ is the set of all ^-typical 
types. For the quantum case, we determine the maximum deviation 5 of a type from the 
true distribution px (x) (this is the distribution from the spectral decomposition of a density 
operator p). This definition allows us to write the strongly 5-typical subspace projector Hf" 
of p as a sum over all of the 5- typical type class projectors Hf™: 

nr = £nf\ (i4.ii8) 

t€T S 

Some protocols in quantum Shannon theory such as entanglement concentration in Chap- 



ter [18] employ the above decomposition of the typical subspace projector into types. The 
way that such a protocol works is first to perform a typical subspace measurement on many 
copies of a state, and this measurement succeeds with high probability. One party involved 
in the protocol then performs a type class measurement {II^™} . We perform this latter 
measurement in a protocol if we would like the state to have a uniform distribution over 
states in the type class. One might initially think that the dimension of the remaining state 
would not be particularly large, but it actually holds that the dimension is large because 
we can obtain the following useful lower bound on the dimension of any typical type class 
projector. 

Property 14.3.2 (Minimal Dimension of a Typical Type Class Projector) Suppose 
that px(x) is the distribution from the spectral decomposition of a density operator p, and t$ 
collects all the type class subspaces with maximum deviation 5 from the distribution px(x). 
Then for any type t G t$ and for sufficiently large n, we can lower bound the dimension of 
the type class projector Tlf n as follows: 

Tr{n x n > 2™[^(p)-'K< M )-^ 1 °g( n + 1 )] ) (14.119) 

where d is the dimension of the Hilbert space where p lives and the function rj(dS) — > as 
8^0. 



Proof. The proof follows directly by exploiting Property |13.7.5| from the previous chapter. 

□ 
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14.4 Concluding Remarks 

This chapter is about the asymptotic nature of quantum information in the IID setting. 
The main technical development is the notion of the typical subspace, and our approach 
here is simply to "quantize" the definition of the typical set from the previous chapter. The 
typical subspace enjoys properties similar to those of the typical set — the probability that 
many copies of a density operator lie in the typical subspace approaches one as the number 
of copies approaches infinity, the dimension of the typical subspace is exponentially smaller 
than the dimension of the full Hilbert space, and many copies of a density operator look 
approximately maximally mixed on the typical subspace. The rest of the content in this 
chapter involves an extension of these ideas to conditional quantum typicality. 

The content in this chapter is here to provide a rigorous underpinning that we can quickly 
cite later on, and after having mastered the results in this chapter along with the tools in 
the next two chapters, we will be ready to prove many of the important results in quantum 
Shannon theory. 

14.5 History and Further Reading 

Ohya and Petz devised the notion of a typical subspace [200] . and later Schumacher inde- 
pendently devised it when he proved the quantum data compression theorem bearing his 
name |216j . Holevo [144] . Schumacher, and Westmoreland |219j introduced the condition- 
ally typical subspace in order to prove the HSW coding theorem. Winter's thesis is a good 
source for proofs of several properties of quantum typicality [256] . The book of Nielsen and 
Chuang uses weak conditional quantum typicality to prove the HSW theorem |197j . Bennett 
et al. (34] and Holevo |146j introduced frequency-typical (or strongly-typical) subspaces to 
quantum information theory in order to prove the entanglement-assisted classical capacity 
theorem. Devetak used strong typicality to prove the HSW coding theorem in Appendix B 
of Ref. 
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The Packing Lemma 



The Packing Lemma is a general method for one party to pack classical messages into a 
Hilbert space so that another party can distinguish the packed messages. The first party has 
access to an ensemble of quantum states, and the other party has access to a set of projectors 
from which he can form a quantum measurement. If the ensemble and the projectors satisfy 
the conditions of the Packing Lemma, then it guarantees the existence of scheme by which 
the second party can distinguish the classical messages that the first party prepares. 

The statement of the Packing Lemma is quite general, and this approach has a great ad- 
vantage because we can use it as a primitive in many coding theorems in quantum Shannon 
theory. Examples of coding theorems that we can prove with the Packing Lemma are the 
Holevo- Schumacher- Westmoreland (HSW) theorem for transmission of classical information 
over a quantum channel and the entanglement-assisted classical capacity theorem for the 
transmission of classical information over an entanglement-assisted quantum channel (fur- 



thermore, Chapter 21 shows that these two protocols are sufficient to generate most known 
protocols in quantum Shannon theory). Combined with the Covering Lemma of the next 
chapter, the Packing Lemma gives a method for transmitting private classical information 
over a quantum channel, and this technique in turn gives a way to communicate quantum 
information over a quantum channel. As long as we can determine an ensemble and a set 
of projectors satisfying the conditions of the Packing Lemma, we can apply it in a straight- 



forward way. For example, we prove the HSW coding theorem in Chapter |19] largely by 
relying on the properties of typical and conditionally typical subspaces that we proved in 
the previous chapter, and some of these properties are equivalent to the conditions of the 
Packing Lemma. 

The Packing Lemma is a "one-shot" lemma because it applies to a general scenario that 
is not limited only to IID uses of a quantum channel. This "one-shot" approach is part of 
the reason that we can apply it in so many different situations. The technique of proving 
a "one-shot" result and applying it to the IID scenario is a common method of attack in 



quantum Shannon theory (we do it again in Chapter 16 by proving the Covering Lemma 
that helps in determining a way to send private classical information over a noisy quantum 
channel). 
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We begin in the next section with a simple example that illustrates the main ideas of 
the Packing Lemma. We then generalize this setting and give the statement of the Packing 
Lemma. We dissect its proof in several sections that explain the random selection of a code, 
the construction of a quantum measurement, and the error analysis. We finally show how 
to derandomize the Packing Lemma so that there exists some scheme for packing classical 
messages into Hilbert space with negligible probability of error for determining each classical 
message. 



15.1 Introductory Example 



Suppose that Alice would like to communicate classical information to Bob, and suppose 
further that she can prepare a message for Bob using the following BB84 ensemble: 

{|0),|1},|+),|-)}, (15.1) 

where each state occurs with equal probability. Let us label each of the above states by 
the classical indices a, b, c, and d so that a labels |0), b labels |1), etc. She cannot use 
all of the states for transmitting classical information because, for example, |0) and |+) are 
non-orthogonal states and there is no measurement that can distinguish them with high 
probability. 

How can Alice communicate to Bob using this ensemble? She can choose a subset of 
the states in the BB84 ensemble for transmitting classical information. She can choose the 
states |0) and |1) for encoding one classical bit of information. Bob can then perform a Von 
Neumann measurement in the basis {|0), |1)} to determine the message that Alice encodes. 
Alternatively, Alice and Bob can use the states |+) and | — ) in a similar fashion for encoding 
one classical bit of information. 

In the above example, Alice can send two messages by using the labels a and b only. We 
say that the labels a and b constitute the code. The states |0) and |1) are the codewords, the 
projectors |0)(0| and |1)(1| are each a codeword projector, and the projector |0}(0| + |1)(1| is 
the code projector (in this case, the code projector projects onto the whole Hilbert space). 

The construction in the above example gives a way to use a certain ensemble for "packing" 
classical information into Hilbert space, but there is only so much room for packing. For 
example, it is impossible to encode more than one bit of classical information into a qubit 
such that someone else can access this classical information reliably — this is the statement 



of the Holevo bound (Exercise 11.9.1) 



15.2 The Setting of the Packing Lemma 

We generalize the above example to show how Alice can efficiently pack classical information 
into a Hilbert space such that Bob can retrieve it with high probability. Suppose that Alice's 
resource for communication is an ensemble {px(x),a x } xeX of quantum states that she can 
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prepare for Bob, where the states a x are not necessarily perfectly distinguishable. We define 
the ensemble as follows: 

Definition 15.2.1 (Ensemble). Suppose X is a set of size \X\ with elements x, and suppose 
X is a random variable with probability density function px(%) ■ Suppose we have an ensemble 
{px{x),a x } x( - x of quantum states where we encode each realization x into a quantum state 
o~ x . The expected density operator of the ensemble is 



Y,Px{x)a x . (15.2) 



a = 

How can Alice transmit classical information reliably to Bob by making use of this en- 
semble? As suggested in the example from the previous section, Alice can select a subset 
of messages from the set X, and Bob's task is to distinguish this subset of states as best 
he can. We equip him with certain tools: a code subspace projector fl and a set of code- 
word subspace projectors {n,,,} ^. with certain desirable properties (we explain these terms 
in more detail below). As a rough description, he can use these projectors to construct a 
quantum measurement that determines the message Alice sends. He would like to be almost 
certain that the received state lies in the subspace onto which the code subspace projector n 
projects. He would also like to use the codeword subspace projectors {Hr} ^ to determine 
the classical message that Alice sends. If the ensemble and the projectors satisfy certain 
conditions, the four conditions of the Packing Lemma, then it is possible for Bob to build 
up a measurement such that Alice can communicate reliably with him. 

Suppose that Alice chooses some subset C of X for encoding classical information. The 
subset C that Alice chooses constitutes a code. Let us index the code C by a message set A4 
with elements m. The set A4 contains messages m that Alice would like to transmit to Bob, 
and we assume that she chooses each message m with equal probability. The subensemble 
that Alice uses for transmitting classical information is thus as follows: 

{wy'-l (15 ' 3) 

where each c m is a codeword that depends on the message m and takes a value in X. 

Bob needs a way to determine the classical message that Alice transmits. The most 
general way that quantum mechanics offers for retrieving classical information is a POVM. 
Thus, Bob performs some measurement described by a POVM {A m } mg ^. Bob constructs 
this POVM by using the codeword subspace projectors {n^}^^ and the code subspace 
projector n (we give the explicit construction in the proof of the Packing Lemma). If Alice 
transmits a message m, the probability that Bob correctly retrieves the message m is as 
follows: 

Tr{A m a Cm }. (15.4) 

Thus, the probability of error for a given message m while using the code C is as follows: 

p e (m,C) = l-Tr{A m a Cm } (15.5) 

= Tr{(/-A m )a Cm }. (15.6) 
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We are interested in the performance of the code C that Alice and Bob choose, and we 
consider three different measures of performance. 

1. The first and strongest measure of performance is the maximal probability of error of 
the code C. A code C has maximum probability of error e if the following criterion 
holds 

e = maxp e (m,C). (15.7) 

771 

2. A weaker measure of performance is the average probability of error p e (C) of the code 
C where 

1 \M\ 

p e (C) = — J> e (m,C). (15.8) 

' ' 777 = 1 

3. The third measure of performance is even weaker than the previous two but turns out 
to be the most useful in the mathematical proofs. It uses a conceptually different notion 
of code called a random code. Suppose that Alice and Bob choose a code C randomly 
from the set of all possible codes according to some probability density pc (the code C 
itself therefore becomes a random variable!) The third measure of performance is the 
expectation of the average probability of error of a random code C where the expectation 
is with respect to the set of all possible codes with message set Ai chosen according 
to the density pc'- 



I I lMl 
E c {p e (C)} = E c \— Y^Pe(m,C) 



777 = 1 



( 1 ^^ 

c \ m 



(15.9) 
p e (m,C)\- (15.10) 



We will see that considering this performance criterion simplifies the mathematics 
in the proof of the Packing Lemma. Then we will employ a series of arguments to 
strengthen the result for this weakest performance criterion to the first and strongest 
performance criterion. 



15.3 Statement of the Packing Lemma 



Lemma 15.3.1 (Packing Lemma). Suppose that we have an ensemble as in Definition 15.2.1 
Suppose that a code subspace projector U and codeword subspace projectors {H X } X £X exist, 
they project onto subspaces ofH, and these projectors and the ensemble satisfy the following 
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conditions: 



Tr{Ua x } > 1 - e, 
Ti{U x a x } > 1 - e, 
Trjnj < d, 

n^n < — n, 



(15.11) 
(15.12) 
(15.13) 

(15.14) 



where < d < D. Suppose that A4. is a set of size \A4\ with elements m. We generate a 
set C = {Cm} m&M of random variables C m where each random variable C m corresponds to 
the message m, has density px{%) so that its distribution is independent of the particular 
message m, and takes a value in X . This set constitutes a random code. Then there exists 
a corresponding POVM (A m ) m€M that reliably distinguishes between the states {oc m )meM ^ n 
the sense that the expectation of the average probability of detecting the correct state is high: 



E clr^^{A m a a J^>l-2(e + 2^)-^-^ \ 



\M\ 



;i5.i5) 



given that D/d is large, \A4\ <C D/d, and e is arbitrarily small. 



Condition (15.11) states that the code subspace with projector n contains each message 



a x with high probability. Condition (|15.12|) states that each codeword subspace projector 

n T 



contains its corresponding state o~ x with high probability. Condition (15.13) states that 
the dimension of each codeword subspace projector n a 
d. 



is less than some positive number 



Condition (15.14) states that the distribution of the ensemble with expected density 



operator a is approximately uniform when projecting it onto the subspace with projector n. 



Conditions (15.11) and (15.14) imply that 



Tr{n} > D{\ - e), 



(15.16) 



so that the dimension of the code subspace projector n is approximately D. We show how to 
construct a code with messages that Alice wants to send. These four conditions are crucial for 
constructing a decoding POVM with the desirable property that it can distinguish between 
the messages with high probability. 

The main idea of the Packing Lemma is that we can pack \M.\ classical messages into 
a subspace with corresponding projector n. There is then a small probability of error when 
trying to detect the classical messages with codeword subspace projectors YL X . The intuition 



is the same as that depicted in Figure |2.6| We are trying to pack as many subspaces of size 
d into a larger space of size D. In the proof of the HSW coding theorem in Chapter 19 
D will be of size ~ 2 nH ( B ^ and d will be of size ~ 2 nH ^ B ^ x \ suggesting that we pack in 
w 2n[H(B)-H{B\x)} _ 2ni(x-,B) messa g es w hile still being able to distinguish them reliably. 
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15.4 Proof of the Packing Lemma 

The proof technique employs a Shannon-like argument where we generate a code at random. 
We show how to construct a POVM, the "pretty-good" measurement, that can decode a 
classical message with high probability. We then prove that the expectation of the average 
error probability is small (where the expectation is over all random codes). In a corollary in 
the next section, we finally use standard Shannon-like arguments to show that a code exists 
whose maximal probability of error for all messages is small. 

15.4.1 Code Construction 

We present a Shannon-like random coding argument to simplify the mathematics that follow. 
We construct a code C at random by independently generating \A4\ codewords according to 
the distribution px(x). Let C = {c m } ml - M be a collection of the realizations c m of \A4\ 
independent random variables C m . Each C m takes a value c m in X with probability px(c m ) 
and represents a classical codeword in the random code C. The probability p(C) of choosing 
a particular code C is equal to the following: 

\M\ 

P(C)= ]Jpx(c m ). (15.17) 

m=l 

There is a great advantage to choosing the code in this way. The expectation of any product 
f(C m )g(C m r) of two functions / and g of two different random codewords C m and C m >, where 
the expectation is with respect to the random choice of code, factors as follows: 

®c{f(C m )g(C m ,)} = 5>(c)/(c m Mc m (15.18) 

c 

= J2px(ci)--- Yl Px(c\M\)f(c m )g(c m >) (15.19) 

= ^2 Px{c m )f(c m ) ^2 Px{c m ,)g(c m >) (15.20) 

= E x {f(X)}E x {g(X)}. (15.21) 



This factoring happens because of the random way in which we choose the code, and we 
exploit this fact in the proof of the Packing Lemma. We employ the following events in 
sequence: 

1. We choose a random code as described above. 

2. We reveal the code to the sender and receiver. 

3. The sender chooses a message m at random (with uniform probability according to 
some random variable M) from Ai and encodes it in the codeword c m . The quantum 
state that the sender transmits is then equal to a Cm . 
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4. The receiver performs the POVM (A m ) me _ M to determine the message that the sender 
transmits, and each POVM element A m corresponds to a message m in the code. The 
receiver obtains a classical result from the measurement, and we model it with the 
random variable M'. The conditional probability Pr{M' = m \ M = m} of obtaining 
the correct result from the measurement is equal to 

Pr{M' = m | M = m} = Tr{A m a Cm }. (15.22) 

5. The receiver decodes correctly if M' = M and decodes incorrectly if M' ^ M. 

15.4.2 POVM Construction 

We cannot directly use the projectors U x in a POVM because they do not satisfy the con- 
ditions for being a POVM. Namely, it is not necessarily true that Y^xex n x = /. Also, 
the codeword subspace projectors fl x may have support outside that of the code subspace 
projector II. 

To remedy these problems, first consider the following set of operators: 

Vx x, = nn^n. (15.23) 

The operator T x is a positive operator, and the effect of "coating" the codeword subspace 
projector H x with the code subspace projector II is to slice out any part of the support IT^ 



that is not in the support of II. From the conditions (15.11 15.12) of the Packing Lemma, 
there should be little probability for our states of interest to lie in the part of the support of 
Yi x outside the support of II. The operators T^ have the desirable property that they only 
have support inside of the subspace corresponding to the code subspace projector II. So we 
have remedied the second problem stated above. 

We now remedy the first problem stated above by constructing a POVM {A m } ml - M with 
the following elements: 




A - = > . T <w T Cm > _ T-, . (15.24) 



The above POVM is the "pretty-good" or "square-root" measurement. The POVM elements 
also have the property that ^m=i A ™ — ?■ Note that the inverse square root A~? of an 
operator A is defined as the inverse square root operation only on the support of A. That 
is, given a spectral decomposition of the operator A so that 

A = ^2a\a)(a\, (15.25) 

a 

and 

yH = ^/(a)|a)(a|, (15.26) 
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where 



fu,) = \ y ; a a tl ■ ( 15 - 27 ) 



We can have a complete POVM by adding the operator Ao = / — ^2 m A m to the set. The 
idea of the pretty-good measurement is that the POVM elements {A m }| ri J 1 correspond to 
the messages sent and the element Ao corresponds to an error result. 

The above square root measurement is useful because we can apply the following result 
that holds for any positive operators S and T such that < S < I and T > 0: 

I-(S + T)~*S(S + Tn < 2(7 -S) + AT. (15.28) 

Exercise 15.4.1 (Hayashi-Nagaoka Operator Inequality) Prove the following inequal- 
ity 

/ - (S + T)-*S(S + T)"^ < 2(1 -S)+4T (15.29) 

that holds for any positive operators S and T such that < S < I and T > 0. (Hint: 
Suppose that the projection onto the range of S + T is the whole of Hilbert space. Use the 
fact that 

(M - 2I)T(M - 21) > (15.30) 

for a positive operator T and any operator M. Use the fact that \f- is operator monotone: 
A > B => \fA > VB.) 

We make the following substitutions: 

\M\ 

T=J2 T Cm - S = T Cm , (15.31) 



so that the bound in (15.28) becomes 



\M\ 

I - A m < 2(1 - T Cm ) + 4 Y, r c m ,- (15.32) 

The above expression is useful in our error analysis in the next section. 

15.4.3 Error Analysis 

Suppose we have chosen a particular code C. Let p e (m,C) be the probability of decoding 
incorrectly given that message m was sent while using the code C: 

Pe (m,C) = Tr{(/ - A m )a Cm }. (15.33) 
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Then using (15.28), the following bound applies to the average error probability: 



\M\ 



p e (m,C)<2Tr{(/-T Cr > Cm } + 4TW ^ T Cm ,a c 



m'=£rn 



\M\ 
2 Tr{(/ - T Cm )a Cm } + 4 £ Tr{T Cm ,a Cm } 



(15.34) 



(15.35) 



The above bound on the message error probability for code C has a similar interpretation as 
that in classical Shannon-like proofs. We bound the error probability by the probability of 



decoding to any message outside the message space operator T Cm (the first term in (15.35)) 



summed with the probability of confusing the transmitted message with a message c m i dif- 



ferent from the correct one (the second term in (15.35)). The average error probability p e {C) 
over all transmitted messages for code C is 



PeiC) 



I lMl 



\M 



;i5.36) 



m=l 



because Alice chooses the message m that she would like to transmit according to the uniform 
distribution. The average error probability p e (C) then obeys the following bound: 



1 \M\ 



m=l 



\M\ 



2Tr{(/-T Cn > Cm } + 4 J2 Tr ( T c m ^c m } 



(15.37) 
Consider the first term Tr{(7 — T Cm )a Cm } on the RHS above. We can bound it from 



above by a small number, simply by applying (15.11 15.12) and the Gentle Operator Lemma 



(Lemma 9.4.2). Consider the following chain of inequalities: 



Tr{T Cm a Cm } = Tr{nn Cm na Cm } 

= Tr{n Cm na Cm n} 

yTrin^acJ-Wlla^n-acJ, 
> l-e-2-y/e 



(15.38) 
(15.39) 
(15.40) 
(15.41) 



The first equality follows by the definition of T Cm in (15.23). The second equality follows from 



cyclicity of the trace. The first inequality follows from applying (9.58). The last inequality 



follows from applying (15.11) to Tr{Il Cm cr Cm } and applying (15.12) and the Gentle Operator 



Lemma (Lemma 9.4.2) to ||rier Cm n — cr Cm || 1 . The above bound then implies the following 
one: 



Tr{(/-T c ,> Cm } = l-Tr{T Cm a Cm } 

<e + 2^, 



(15.42) 
(15.43) 
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and by substituting into (15.37), we get the following bound on the average probability of 
error: 



1 \M\ 



m=l 



\M\ 
2( e + 2^)+4 ^Tr{T Cm ,a Cm } 

. \M\ \M\ 



2 (e + 2 ^ + \M\ ^ ^ Tr ( T ^ a -}- 

m=l m'jtm 



(15.44) 



(15.45) 



At this point, bounding the average error probability further is a bit difficult, given 
the sheer number of combinations of terms Tr{T c ,c Cm } that we would have to consider 
to do so. Thus, we should now invoke the classic Shannon argument in order to simplify 
the mathematics. Instead of considering the average probability of error, we consider the 
expectation of the average error probability E c {p e (C)} with respect to all possible random 
codes C. Considering this error quantity significantly simplifies the mathematics because of 
the way in which we constructed the code. We can use the probability distribution px (x) to 
compute the expectation E c because we constructed our code according to this distribution. 
The bound above becomes as follows: 



4 



\M\ \M\ 



E c {p e (C)} < E c { 2 ( e + 2>/i) + ^ J2 E Tr i T ^^ m } 



4 



m=l m'^m 
\M\ \M\ 



2 (C + 2 ^ } + M E E E c{Tr{T Cm/ a Cm }}, 

m=l m'jtm 



(15.46) 
(15.47) 



by exploiting the linearity of expectation. 

We now calculate the expectation of the expression Tr{Tc ,&c m } over a ll random codes 
C: 



E c { Tr { T Cm , a Cm } } = E c { Tr { UU Cm , Ha Cm } } 

= E c {Tr{n Cm ,na Cm n}}. 



(15.48) 
(15.49) 



The first equality follows from the definition in (15.23), and the second equality follows 



from cyclicity of trace. Independence of random variables C m and C m i (from the code 
construction) gives that the above expression equals 



Tr{E c {n Cm/ }nE c {ac m }n} = Tr{E c {n Cm ,}nan} 

<TrJE c {n Cro ,}-^nj 



D 



Tr{E c {n Cm ,}n}. 



(15.50) 
(15.51) 

(15.52) 



where the first equality uses the fact that Ec{(Jc m } = 'Yli X ax'P( x )' J x = ° anc ^ n is a constant 



with respect to the expectation. The first inequality uses the fourth condition (15.14) of the 
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Packing Lemma, the fact that nail, II, and lie / are all positive operators, and Tr{CA} > 
Tt{CB} for C > and A > B. Continuing, we have 



^Tr{E c {n Cm ,}n} < -^Tr{E c {n Cm ,}} 



< 



1 

D 

d 

D' 



Ec{Tr{n Cm ,}} 



(15.53) 
(15.54) 
(15.55) 



The first inequality follows from the fact that II < / and lie , is a positive operator. The 



last inequality follows from ( |15.13[ ). The following inequality then holds by considering the 

d 



development from (15.48) to ( 15.55[ ): 

E c {Tr{a Cm r Cm ,}} < 



D' 



(15.56) 



We substitute into (15.35) to show that the expectation Ec{p e (C)} of the average error 
probability p e {C) over all codes obeys 

\M\ \M\ 



E c {p e (C)} < 2 (e + 2>/i) + r^ £ £ E c{Tr{a Cm T Cm ,}} 



<2(e + 2^) 



4 M ^ d 



\M\^ *r? D 



m=l m'y^m 



(15.57) 

(15.58) 

(15.59) 
(15.60) 



15.5 Derandomization and Expurgation 

The above version of the Packing Lemma is a randomized version that shows how the expec- 
tation of the average probability of error is small. We now prove a derandomized version that 
guarantees the existence of a code with small maximal error probability for each message. 
The last two arguments are traditionally called derandomization and expurgation. 



Corollary 15.5.1. Suppose we have the ensemble as in Definition \15.2. 1\ Suppose that a 
code subspace projector II and codeword subspace projectors {n x } x€X exist, they project onto 
subspaces ofTL, and these projectors and the ensemble have the following properties: 

TrjILxJ > 1 - e, (15.61) 

Tr{n,a s } > 1 - e, (15.62) 

TrjILj < d, (15.63) 

n^n < — n, (15.64) 
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where < d < D. Suppose that M. is a set of size \M.\ with elements m. Then there exists 
a code Co = {c m } me A4 with codewords c m depending on the message m and taking values in 
X and there exists a corresponding POVM (A m ) me _ M that reliably distinguishes between the 
states (c Cm ) me _A4 ^ n the sense that the probability of detecting the correct state is high: 

VmeTM Tr{A„,a c ,.}>l-4(e + 2V?)-8f-^-J , (15.65) 

because w e can make c and (^ arbitrarily small (Ms holds if \M\ « D/d). We can 

use the code C and the POVM (A m ) me _ M respectively to encode and decode \A4\ classical 
messages with high success probability. 

Proof. Generate a random code according to the construction in the previous lemma. The 
expectation of the average error probability then satisfies the bound in the Packing Lemma. 
We now make a few standard Shannon-like arguments to strengthen the result of the previous 
lemma. Derandomization. The expectation of the average error probability Ec{p e (C)} 
satisfies the following bound: 

E c {p e (C)} < e'. (15.66) 

It then follows that the average error probability of at least one code C = {c m } m€M satisfies 
the above bound: 

p e (Co) < t. (15.67) 

Choose this code Co as the code, and it is possible to find this code Cq in practice by exhaustive 
search. This process is known as derandomization. 

Exercise 15.5.1 Use Markov's inequality to prove an even stronger derandomization of the 
Packing Lemma. Prove that the overwhelming fraction 1 — Vt of codes contructed randomly 
have average error probability less than \/e. 



Expurgation. We now consider the maximal error probability instead of the average 
error probability by an expurgation argument. We know that p e (^) < 2e' for at least half 
of the indices (if it were not true, then these indices would contribute more than e' to the 
average error probability p e ). Throw out the half of the codewords with the worst decoding 
probability and redefine the code according to the new set of indices. These steps have a 
negligible effect on the parameters of the code when we later consider a large number of uses 



of a noisy quantum channel. It is helpful to refer back to Exercise 2.2.1 at this point. □ 



Exercise 15.5.2 Use Markov's in equality to prove an even stronger expurgation argument 
(following on the result of Exercise 15.5.1 ). Prove that we can retain a large fraction 1 — \fd 



of the codewords (expurgating \/e' of them) so that each remaining codeword has error 
probability less than ye'. 
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Exercise 15.5.3 Prove that the Packing Lemma and its corollary hold for the same ensemble 
and a set of projectors for which the following conditions hold: 



J2px(x)Tr{a x U} > 1 - e, (15.68) 

J2Px(x)Tr{a x U x } > 1 - e, (15.69) 



x&X 

Tr{ILj < d, (15.70) 

n^n < — n, (15.71) 

Exercise 15.5.4 Prove that a variation of the Packing Lemma holds in which the POVM 
is of the following form: 

/ \M\ \~» / \M\ \~» 

^ = E n «W n - E n «W • ( 15 - 72 ) 

\m'=l / \m'=l / 

That is, it is not actually necessary to "coat" each operator in the square-root measurement 
with the overall message subspace projector. 

15.6 History and Further Reading 

Holevo |144| . Schumacher, and Westmoreland |219| did not prove the classical coding theorem 
with the Packing Lemma, but they instead used other arguments to bound the probability 



of error. The operator inequality in (15.28) is at the heart of the Packing Lemma. Hayashi 
and Nagaoka proved this operator inequality in the more general setting of the quantum 
information spectrum method [130] . where there is no IID constraint and essentially no 
structure to a channel. Devetak et al. later exploited this operator inequality in the context 
of entanglement-assisted classical coding |156j and followed the approach in Ref. |130] to 
prove the Packing Lemma. 
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The Covering Lemma 



The goal of the Covering Lemma is perhaps opposite to that of the Packing Lemma because 
it applies in a setting where one party wishes to make messages indistinguishable to another 
party (instead of trying to make them distinguishable as in the Packing Lemma of the 
previous chapter). That is, the Covering Lemma is helpful when one party is trying to 
simulate a noisy channel to another party, rather than trying to simulate a noiseless channel. 
One party can accomplish this task by randomly covering the Hilbert space of the other 
party (this viewpoint gives the Covering Lemma its name). 

One can certainly simulate noise by choosing a quantum state uniformly at random from 
a large set of quantum states and passing along the chosen quantum state to a third party 
without telling which state was chosen. But the problem with this approach is that it could 
potentially be expensive if the set from which we choose a random state is large, and we 
would really like to use as few resources as possible in order to simulate noise. That is, we 
would like the set from which we choose a quantum state uniformly at random to be as small 
as possible when simulating noise. The Covering Lemma is similar to the Packing Lemma 
in the sense that its conditions for application are general (involving bounds on projectors 
and an ensemble), but it gives an asymptotically efficient scheme for simulating noise when 
we apply it in an IID setting. 

One application of the Covering Lemma in quantum Shannon theory is in the construction 
of a code for transmitting private classical information over a quantum channel (discussed 



in Chapter 22). The method of proof for private classical transmission involves a clever 
combination of packing messages so that Bob can distinguish them, while covering Eve's 
space in such a way that Eve cannot distinguish the messages intended for Bob. A few other 
applications of the Covering Lemma are in secret key distillation, determining the amount 
of noise needed to destroy correlations in a bipartite state, and compressing the outcomes of 
an IID measurement on an IID quantum state. 

We begin this chapter with a simple example to explain the main idea behind the Covering 



Lemma. Section [16.2| then discusses its general setting and gives its statement. We dissect its 
proof into several different parts: the construction of a "Chernoff ensemble," the construction 
of a "Chernoff code," the application of the Chernoff bound, and the error analysis. The 
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main tool that we use to prove the Covering Lemma is the Operator Chernoff bound. This 
bound is a generalization of the standard Chernoff bound from probability theory, which 
states that the sample mean of a sequence of IID random variables converges exponentially 
to its true mean. The proof of the operator version of the Chernoff bound is straightforward 
and we provide it in Appendix |Aj The exponential convergence rate in the Chernoff bound 
is much stronger than the polynomial convergence rate from Chebyshev's inequality and is 



helpful in proving the existence of good private classical codes in Chapter [22 

16.1 Introductory Example 

Suppose that Alice is trying to communicate with Bob as before, but now there is an eaves- 
dropper Eve listening in on their communication. Alice wants the messages that she is 
sending to Bob to be private so that Eve does not gain any information about the message 
that she is sending. 

How can Alice make the information that she is sending private? The strongest criterion 
for security is to ensure that whatever Eve receives is independent of what Alice is sending. 
Alice may have to sacrifice the amount of information she can communicate to Bob in order 
to have privacy, but this sacrifice is worth it to her because she really does not want Eve to 
know anything about the intended message for Bob. 

We first give an example to motivate a general method that Alice can use to make her 
information private. Suppose Alice can transmit one of four messages {a, b, c, d} to Bob, 
and suppose he receives them perfectly as distinguishable quantum states. She chooses from 
these messages with equal probability. Suppose further that Alice and Eve know that Eve 
receives one of the following four states corresponding to each of Alice's messages: 

a^|0), & -HI), c ^l+)> d ^\-)- (16.1) 

Observe that each of Eve's states lies in the two-dimensional Hilbert space of a qubit. We 
refer to the quantum states in the above ensemble as "Eve's ensemble." 

We are not so much concerned for what Bob receives for the purposes of this example, 
but we just make the assumption that he can distinguish the four messages that Alice sends. 
Without loss of generality, let us just assume that he receives the messages unaltered in 
some preferred orthonormal basis such as {\a), \b), \c), \d)} so that he can distinguish the 
four messages, and let us call this ensemble "Bob's ensemble." 

Both Alice and Eve then know that the expected density operator of Eve's ensemble is 
the maximally mixed state if Eve does not know which message Alice chooses: 

\\0)(0\ + ±|1)<1| + \\+)(+\ + \\-)(-\ = '-. (16.2) 

How can Alice ensure that Eve's information is independent of the message Alice is sending? 
Alice can choose subsets or subensembles of the states in Eve's ensemble to simulate the 
expected density operator of Eve's ensemble. Let us call these new simulating ensembles the 
"fake ensembles." Alice chooses the member states of the fake ensembles according to the 
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uniform distribution in order to randomize Eve's knowledge. The density operator for each 
new fake ensemble is its "fake expected density operator." 

Which states work well for being members of the fake ensembles? An equiprobable mix- 
ture of the states |0) and |1) suffices to simulate the expected density operator of Eve's 
ensemble because the fake expected density operator of this new ensemble is as follows: 

i|0)(0| + ^|l>(l| = ^. (16.3) 

An equiprobable mixture of the states |+) and | — ) also works because the fake expected 
density operator of this other fake ensemble is as follows: 

i|+X+l + i|-X-l4 (16.4) 

So it is possible for Alice to encode a private bit this way. She first generates a random bit 
that selects a particular message within each fake ensemble. So she selects a or b according 
to the random bit if she wants to transmit a "0" privately to Bob, and she selects c or d 
according to the random bit if she wants to transmit a "1" privately to Bob. In each of 
these cases, Eve's resulting expected density operator is the maximally mixed state. Thus, 
there is no measurement that Eve can perform to distinguish the original message that Alice 
transmits. Bob, on the other hand, can perform a measurement in the basis (|a), \b), \c), \d)} 
to determine Alice's private bit. Then Eve's best strategy is just to guess at the transmitted 
message. In the case of one private bit, Eve can guess its value correctly with probability 
1/2, but Alice and Bob can make this probability exponentially small if Alice sends more 
private bits with this technique (the guessing probability becomes ^ for n private bits). 

We can explicitly calculate Eve's accessible information about the private bit. Consider 
Eve's impression of the state if she does not know which message Alice transmits — it is an 
equal mixture of the following states: {|0), |1), |+), | — }} (the maximally mixed state 1/2). 
Eve's impression of the state "improves" to an equal mixture of the states {|0), |1)} or 
{|+}, | — }}, both with density operator 1/2, if she does know which message Alice transmits. 
The following classical-quantum state describes this setting: 



KME _ 1 



|0)(0|* ® |0}(0| M <g> |0}<0f + |0><0|* <g> |1)(1| M ® |1)(1| B + 

L |i)(i|* ® |o)(o| M <g> \+){+\ E + |i)(i|* ® |i)(i| M ® l-X-T 



(16.5) 



where we suppose that Eve never has access to the M register. If she does not have access to 
K, then her state is the maximally mixed state (obtained by tracing out K and M). If she 
does know K, then her state is still the maximally mixed state. We can now calculate Eve's 
accessible information about the private bit by evaluating the quantum mutual information 
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of the state p KME : 

I(K;E) o = H(E) o -H(E\K) p (16.6) 

k=0 

= H (0 " iff({|0}, |1>}) - ii?({|+), h)}) (16.8) 

= KD " HD ~ HQ = °' (16 ' 9) 

Thus using this scheme, Eve has no accessible information about the private bit as we argued 
before. 

We are interested in making this scheme use as little noise as possible because Alice 
would like to transmit as much information as she can to Bob while still retaining privacy. 
Therefore, Alice should try to make the fake ensembles as small as possible. In the above 
example, Alice cannot make the fake ensembles any smaller because a smaller size would 
leak information to Eve. 



16.2 Setting and Statement of the Covering Lemma 

The setting of the Covering Lemma is a generalization of the setting in the above example. 
It essentially uses the same strategy for making information private, but the mathematical 
analysis becomes more involved in the more general setting. In general, we cannot have 
perfect privacy as in the above example, but instead we ask only for approximate privacy. 
Approximate privacy then becomes perfect in the asymptotic limit in the IID setting. 

We first define the relevant ensemble for the Covering Lemma. We call it the "true 
ensemble" in order to distinguish it from the "fake ensemble." 

Definition 16.2.1 (True Ensemble). Suppose X is a set of size \X\ with elements x. Suppose 
we have an ensemble {px(x),a x } xl - x of quantum states where each value x occurs with prob- 
ability px(x) according to some random variable X , and suppose we encode each value x into 
a quantum state o~ x . The expected density operator of the ensemble is a = ^2 xe xPx(x)o~ x . 

The definition for a fake ensemble is similar to the way that we constructed the fake 
ensembles in the example. It is merely a subset of the states in the true ensemble chosen 
according to a uniform distribution. 

Definition 16.2.2 (Fake Ensemble). Consider a set S where S C X . The fake ensemble is 
as follows: 

1 ,os\ ■ (16.10) 



PI ) s£S 
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Let a denote the "fake expected density operator" of the fake ensemble: 

In the example, Alice was able to obtain perfect privacy from Eve. We need a good 
measure of privacy because it is not possible in general to obtain perfect privacy, but Alice 
can instead obtain only approximate privacy. We call this measure the "obfuscation error" 
because it determines how well Alice can obfuscate the state that Eve receives. 

Definition 16.2.3 (Obfuscation Error). The obfuscation error o e (S) of set S is a measure of 
how close the fake expected density operator a(S) is to the actual expected density operator: 

o e (S) = \\W(S) -a|| v (16.12) 

The goal for Alice is to make the size of her fake ensembles as small as possible while 
still having privacy from Eve. The covering lemma makes this tradeoff exact by determining 
exactly how small each fake ensemble can be in order to obtain a certain obfuscation error. 

The hypotheses of the Covering Lemma are somewhat similar to those of the Packing 
Lemma. But as stated in the introduction of this chapter, the goal of the Covering Lemma 
is much different. 

Lemma 16.2.1 (Covering Lemma). Suppose we are given an ensemble as defined in Defini- 
tion 16.2.1. Suppose a total subspace projector II and codeword subspace projectors {Ii x } x&x 
exist, they project onto subspaces of the Hilbert space in which the states {cr x } exist, and 
these projectors and the ensemble satisfy the following conditions: 

Tr{aJI} > 1 - e (16.13) 

TrKlLj > 1 - e (16.14) 

Tr{II} < D (16.15) 

U x a x U x < -U x (16.16) 

Suppose that M. is a set of size \M\ with elements m. Let a random covering code C = 
{C m } m< - M consist of random codewords C m where the codewords C m are chosen according 

to the distribution px(x) and give rise to a fake ensemble < xhr\i^o m > . Then there is a 

high probability that the obfuscation error o e {C) of the random covering code C is small: 

Pr{o e (C)<6 + 4^ + 24^}>l-2^exp(^- 4 ^ 2 | - A ^ 1 ^ , (16.17) 

when \M.\ 3> d/D. Thus it is highly likely that the expected density operator of the fake en- 
semble \ tjtt, oc m \ is indistinguishable from the expected density operator of the original 

ensemble {px(x), a x } x&x . It is in this sense that the fake ensemble \ |4tt, ac m \ "covers" 

the original ensemble {px(x),a x } xl - x . 
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16.3 Proof of the Covering Lemma 

Before giving the proof of the Covering Lemma, we first state the Operator Chernoff Bound 
that is useful in proving the Covering Lemma. The Operator Chernoff Bound is a theorem 
from the theory of large deviations and essentially states that the sample average of a large 
number of IID random operators is close to the expectation of these random operators 
(with some constraints on the random operators). The full proof of this lemma appears in 
Appendix |A} 

Lemma 16.3.1 (Operator Chernoff Bound). Let £ 1; . . . ,£m be M independent and identi- 
cally distributed random variables with values in the algebra B(TC) of bounded linear operators 
on some Hilbert space TC. Each £ m has all of its eigenvalues between the null operator and 
the identity operator I : 

Vme[M]:0<( m < /. (16.18) 

Let £ denote the sample average of the M random variables: 

1 - 
£=mE£- ( 16 - 19 ) 



m= 



Suppose that the expectation E^{^ m } = \x of each operator £ m exceeds the identity operator 
scaled by a number a G (0, 1): 

\x > al. (16.20) 

Then for every n where < r] < 1/2 and (1 + r])a < 1, we can bound the probability that the 
sample average £ lies inside the operator interval [(1 ± rj)/j]: 

Pr{(l - 77)// < £ < (1 +7/M > 1 - 2dimftexp(-^||Y (16.21) 

Thus it is highly likely that the sample average operator £ becomes close to the true expected 
operator /j as M becomes large. 

The first step of the proof of the Covering Lemma is to construct an alternate ensemble 
that is close to the original ensemble yet satisfies the conditions of the Operator Chernoff 



Bound (Lemma 16.3.1). We call this alternate ensemble the "Chernoff ensemble." We then 
generate a random code, a set of M IID random variables, using the Chernoff ensemble. 
Call this random code the "Chernoff code." We apply the Operator Chernoff Bound to the 
Chernoff code to obtain a good bound on the obfuscation error of the Chernoff code. We 
finally show that the bound holds for a covering code generated by the original ensemble 
because the original ensemble is close to the Chernoff ensemble in trace distance. 

16.3.1 Construction of the Chernoff Ensemble 

We first establish a few definitions to construct intermediary ensembles. We then use these 
intermediary ensembles to construct the Chernoff ensemble. We construct the first "primed" 
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ensemble {px(x), a' x } by using the projection operators Tl x to slice out some of the support 
of the states o x . 

Vx a' x = U x a x U x . (16.22) 

The above "slicing" operation cuts outs any elements of the support of a x that are not in 
the support of Tl x . The expected operator a' for the first primed ensemble is as follows: 

a' = Y,Px{x)<j' x . (16.23) 

We then continue slicing with the projector II and form the second primed ensemble {px(x), &"} 
as follows: 

Vx o" x = n<n. (16.24) 

The expected operator for the second primed ensemble is as follows: 

a" = J2px(x)(t x . (16.25) 

x&X 

Let II be the projector onto the subspace spanned by the eigenvectors of a" whose corre- 
sponding eigenvalues are greater than e/D. We would expect that this extra slicing does 
not change the state very much when D is large. We construct states u> x in the Chernoff 
ensemble by using the projector II to slice out some more elements of the support of the 
original ensemble: 

Vx lo x = fla x fl. (16.26) 

The expected operator ui for the Chernoff ensemble is then as follows: 



J2px{x)oo x . (16.27) 



uj = 

x&X 

The Chernoff ensemble satisfies the conditions necessary to apply the Chernoff bound. We 
wait to apply the Chernoff bound and for now show how to construct a random covering 
code. 

16.3.2 Chernoff Code Construction 

We present a Shannon-like random coding argument. We construct a covering code C at 
random by independently generating \M.\ codewords according to the distribution px{x). 
Let C = {c m } ml - M be a collection of the realizations c m of \A4\ independent random variables 
C m . Each C m takes a value c m in X with probability px(c m ) and represents a codeword in 
the random code C. This process generates the Chernoff code C consisting of \JA\ quantum 
states {^ Cm } m GM- The fake expected operator uJ(C) of the states in the Chernoff code is as 

follows: 

\M\ 



1 v- 

\M\ 



^ C )^TT7Tl^^m, (16-25 



m=l 
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because we assume that Alice randomizes codewords in the Chernoff code according to a 
uniform distribution (notice that there is a difference in the distribution that we use to 
choose the code and the distribution that Alice uses to randomize the codewords). The 
expectation Ec{ooc m } of each operator uic m is equal to the expected operator uo because of 
the way that we constructed the covering code. We can also define codes with respect to the 

primed ensembles as follows: {(J Cm } m&M , « m } meA4 ) WDm^M' These codes res P ectivel y 
have fake expected operators of the following form: 

1 \M\ 

^( c ) = n^E^> ( 16 - 29 ) 

1 \M\ 

W 'W=\M\I2<^ ( 16 - 3 °) 

' ' m=l 
1 \M\ 

' ' rn=l 

Applying the Chernoff Bound: We make one final modification before applying the 
Operator Chernoff Bound. The operators uo Cm are in the operator interval between the null 
operator and |ri: 

Vm G M : < lu c < -IT (16.32) 

d 

The above statement holds because the operators a' x satisfy a' x = TL x a x H x < \Tl x (the fourth 

condition of the Covering Lemma) and this condition implies the following inequalities: 

a' x = U x a x U x < -U x (16.33) 

^ Ua'U = a" < -UU X U < -U (16.34) 

d d 

^uj x = n<7"n < -nnn < -n. (16.35) 

d d 

Therefore, we consider another set of operators (not necessarily density operators) where we 
scale each uo Cm by d so that 

Vm G M : < du Cm < fl. (16.36) 

This code satisfies the conditions of the Operator Chernoff Bound with a = ed/D and with 
fl acting as the identity on the subspace onto which it projects. We can now apply the 
Operator Chernoff Bound to bound the probability that the sample average uJ falls in the 
operator interval [(1 ± e)u>]: 

Pr{(l - e)uj < uJ < (1 + e)uj} = Pr{d(l - t)uj < duJ < d(l + t)uj} (16.37) 

r 2 



.i-^J^f-^ME)) (16 , 8 , 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



16.3. PROOF OF THE COVERING LEMMA 



425 



16.3.3 Obfuscation error of the covering code 

The random covering code is a set of \M\ quantum states {vc m } m &M where the quantum 
states arise from the original ensemble. Recall that our goal is to show that the obfuscation 
error of the random covering code C, 



o e (C) = \\a(C) - a\\ v 



;i6.40) 



has a high probability of being small. 

We now show that the obfuscation error of this random covering code is highly likely to 
be small, by relating it to the Chernoff ensemble. Our method of proof is simply to exploit 



the triangle inequality, the Gentle Operator Lemma (Lemma 9.4.2), and (9.58) several times 



The triangle inequality gives the following bound for the obfuscation error: 

O e (C) 

= \\a(C)-a\\ 1 (16.41) 

= \\a(C) - a"{C) - (uJ(C) - a"(C)) + (ul(C) - u) + (u - a") -(a- a")\\ x (16.42) 

<\\a(C)-a"(C)\\ 1 + \\co(C)-a"(C)\\ 1 



+ \\UJ(C) -co\\ 1 + 



\uj — a 



\a — a 



(16.43) 



We show how to obtain a good bound for each of the above five terms. 

First consider the rightmost term \\a — cr"\\ v Consider that the projected state a' x = 
H x a x Il x is close to the original state a x by applying (16.14) and the Gentle Operator Lemma: 



Wx-v'xWi < 2\/e. 



(16.44) 



Consider that 

\\a' x -a'X< 2^+2^6 

because o' x = na^n and from applying the Gentle Operator Lemma to 



(16.45) 



Tr{na;> > Tr{rhxJ - K " <lli 
> l-e-2^, 



(16.46) 

(16.47) 



where the first inequality follows from Exercise 9.1.7 and the second from (16.13) and (16.44). 



Then the state a" is close to the original state o x for all x because 



\<7, 



"II ^11 / 1 1 || J // 1 1 

Ml < Wx- V x \\l + Wx- a x\\l 



<2y r e + 2\ e + 2^ft, 



;i6.48) 
;i6.49) 



where we first applied the triangle inequality and the bounds from (16.44) and (16.45). 
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Convexity of the trace distance then gives a bound on \\a — o"\\i- 



\a — a 



x€X x€X 

x&X 
x£X 



x)\\\ 



x&X 



2Ve + 2A/e + 2^e 



(16.50) 

(16.51) 
(16.52) 

(16.53) 
(16.54) 



We now consider the second rightmost term ||a> — u"\\ v The support of a" has dimension 
less than D by (16.15), the third condition in the Covering Lemma. Therefore, eigenvalues 



smaller than e/D contribute at most e to Tr{a"}. We can bound the trace of u as follows: 



Tr{w} > (1 - e)Tr{a"} 



= (l- e )Tri^p x (xK / l 

Vx&X ) 

= ^(*)(l-e)TrK} 



x&X 



> (^PxW](l-e)(l-6-2^) 

\x£X / 

= (l-e)(l-e-2>/i) 
>l-2(e + Vi), 



(16.55) 
(16.56) 

(16.57) 

(16.58) 

(16.59) 
(16.60) 



where the first inequality applies the above "eigenvalue bounding" argument and the second 



inequality employs the bound in (16.46). This argument shows that average operator of the 



Chernoff ensemble almost has trace one. We can then apply the Gentle Operator Lemma to 
Tt{lo} > 1 - 2(e + y/e) to give 



lw-o-% < 2J2(e + y/e). 



(16.61) 



We now consider the middle term ||cU — u}\\ v The Chernoff bound gives us a probabilistic 
estimate and not a deterministic estimate like the other two bounds we have shown above. 
So we suppose for now that the fake operator u of the Chernoff code is close to the average 
operator u) of the Chernoff ensemble: 






;i6.62) 
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With this assumption, it holds that 



\u(C) — uiWi < e, 



(16.63) 



by employing Lemma A. 0.2 from Appendix |A| and Tr{o>} < 1. 

We consider the second leftmost term ||u;(C) — a"(C)\\ x . The following inequality holds 



Tv{lJ(C)} > l-3e-2Ve. 



(16.64) 



because in (16.60) we showed that 



Tr{u;}> l-2(e + y/e), 



;i6.65) 



and we use the triangle inequality: 



Tt{5/(C)} = |l"(C)lli 

= \\u-(u-u(C))\\ 1 

> \\uWi_ — \\u) — 6»?(C) Jj a 

= Tt{uj}- \\uj - uJiC)^ 

>(l-2(e+>/i))-e 
= 1 - 3e - 2y/e. 



(16.66) 
(16.67) 
(16.68) 
(16.69) 
(16.70) 
(16.71) 



Apply the Gentle Operator Lemma to Tr{u;(C)} > 1 — 3e — 2y/e to give 



w 



(C)-a"(C)\\ 1 <2J3e + 2V~e. 



;i6.72) 



We finally bound the leftmost term ||<r(C) — a'^C)!^. We can use convexity of trace distance 



and (16.49) to obtain the following bounds: 



HC)-^{c)\\ 1 <^-Y,hc m -<yc, 



m II 1 



< 2y/e + 2\/e + 2y/e. 



(16.73) 
(16.74) 



We now combine all of the above bounds with the triangle inequality in order to bound 
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the obfuscation error of the covering code C: 

O e (C) 

= \\a(C)-a\\ 1 (16.75) 

= \\a(C) - a"(C) - (uJ(C) - a"(C)) + (uJ(C) - u) + (u - a") -(a- a")^ (16.76) 

<\\a(C)-a"(C)\\ 1 + \P(C)-a"(C)\\ 1 

i II — ( /~>\ II i II // 1 1 i II II \\ /in <i<i\ 

+ \\u)(C) — u}\\ 1 + \\u) — a \\ l + \\a — a || x (16.77) 



< [2y/e + 2yje + 2y/e] + (2^36 + 2^) + e 

+ (2^2(e + Ve)j + \2y/l + 2^6 + 2^) (16.78) 



= 6 + 4^ + 4^/6 + 2^ + 2^36 + 2^ + 2^/2(6 + ^) (16.79) 

< e + 4^ + 24^ (16.80) 

Observe from the above that the event that the quantity e bounds the obfuscation error o e (C) 
of the Chernoff code with states u>c m implies the event when the quantity e + Ay/e + 24-^i 
bounds the obfuscation error o e (C) of the original code with states oc m - Thus, we can bound 
the probability of obfuscation error of the covering code by applying the Chernoff bound: 

Pr {o e (C,{a Cm }) < e + 4^ + 24^} > Pr{o e (C, {cu Cm }) < e} (16.81) 

This argument shows that it is highly likely that a random covering code is good in the sense 
that it has a low obfuscation error. 

Exercise 16.3.1 Prove that the Covering Lemma holds for the same ensemble and a set of 
projectors for which the following conditions hold: 



J2px{x)Tr{a x Il}>l-e, (16.83) 

x£ X 

Y,Px(x)Ti{a x U x } > 1 - e, (16.84) 

x£ X 

Tr{n} < D, (16.85) 

n^n, < -n x , (16.86) 

a 

Exercise 16.3.2 Show that there exists a particular covering code with the property that 
the obfuscation error is small. 
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16.4 History and Further Reading 

Ahlswede and Winter introduced the operator Chernoff bound in the context of quan- 
tum identification [7j. Winter et al. later applied it to quantum measurement compres- 
sion |259j I26UJ . Devetak and Winter applied the Covering Lemma to classical compression 
with quantum side information [74] and to distilling secret key from quantum states [76J. 
Devetak [68] and Cai et al. [51] applied it to private classical communication over a quantum 
channel, and Groisman et al. applied it to study the destruction of correlations in a bipartite 
state [TT7| . 
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CHAPTER 17 



Schumacher Compression 



One of the fundamental tasks in classical information theory is the compression of infor- 
mation. Given access to many uses of a noiseless classical channel, what is the best that a 
sender and receiver can make of this resource for compressed data transmission? Shannon's 
compression theorem demonstrates that the Shannon entropy is the fundamental limit for 



the compression rate in the IID setting (recall the development in Section 13.4). That is, 
if one compresses at a rate above the Shannon entropy, then it is possible to recover the 
compressed data perfectly in the asymptotic limit, and otherwise, it is not possible to do 
soj^] This theorem establishes the prominent role of the entropy in Shannon's theory of 
information. 

In the quantum world, it very well could be that one day a sender and a receiver would 
have many uses of a noiseless quantum channel available !q and the sender could use this 
resource to transmit compressed quantum information. But what exactly does this mean 
in the quantum setting? A simple model of a quantum information source is an ensemble 
of quantum states {px(x), |"0z}}? i.e., the source outputs the state \tp x ) with probability 
px{x), and the states {IV^}} do not necessarily have to form an orthonormal basis. Let 
us suppose for the moment that the classical data x is available as well, even though this 
might not necessarily be the case in practice. A naive strategy for compressing this quantum 
information source would be to ignore the quantum states coming out, handle the classical 



data instead, and exploit Shannon's compression protocol from Section |13.4[ That is, the 
sender compresses the sequence x n emitted from the quantum information source at a rate 
equal to the Shannon entropy H(X), sends the compressed classical bits over the noiseless 
quantum channels, the receiver reproduces the classical sequence x n at his end, and finally 
reconstructs the sequence \ip x n ) of quantum states corresponding to the classical sequence x n . 
The above strategy will certainly work, but it makes no use of the fact that the noiseless 
quantum channels are quantum! It is clear that noiseless quantum channels will be expensive 



technically, we did not prove the converse part of Shannon's data compression theorem, but the converse 
of this chapter suffices for Shannon's classical theorem as well. 

2 How we hope so! If working, coherent fault-tolerant quantum computers come along one day, they stand 
to benefit from quantum compression protocols. 
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in practice, and the above strategy is wasteful in this sense because it could have merely 
exploited classical channels (channels that cannot preserve superpositions) to achieve the 
same goals. Schumacher compression is a strategy that makes effective use of noiseless 
quantum channels to compress a quantum information source down to a rate equal to the 
von Neumann entropy. This has a great benefit from a practical standpoint — recall from 



Exercise |11.9.2| that the von Neumann entropy of a quantum information source is strictly 
lower than the source's Shannon entropy if the states in the ensemble are non-orthogonal. 
In order to execute the protocol, the sender and receiver simply need to know the density 
operator p = ^2 x Px(x)\tp x ){i ) x\ of the source. Furthermore, Schumacher compression is 
provably optimal in the sense that any protocol that compresses a quantum information 
source of the above form at a rate below the von Neumann entropy cannot have a vanishing 
error in the asymptotic limit. 

Schumacher compression thus gives an operational interpretation of the von Neumann 
entropy as the fundamental limit on the rate of quantum data compression. Also, it sets 
the term "qubit" on a firm foundation in an information-theoretic sense as a measure of the 
amount of quantum information "contained" in a quantum information source. 

We begin this chapter by giving the details of the general information processing task 
corresponding to quantum data compression. We then prove that the von Neumann entropy 
is an achievable rate of compression and follow by showing that it is optimal (these two 
respective parts are the direct coding theorem and converse theorem for quantum data 
compression). We illustrate how much savings one can gain in quantum data compression 
by detailing a specific example. The final section of the chapter closes with a presentation 
of more general forms of Schumacher compression. 

17.1 The Information Processing Task 

We first overview the general task that any quantum compression protocol attempts to 
accomplish. Three parameters n, R, and e corresponding to the length of the original 
quantum data sequence, the rate, and the error, respectively, characterize any such protocol. 
An (n, R + S, e) quantum compression code consists of four steps: state preparation, encoding, 



transmission, and decoding. Figure 17.1 depicts a general protocol for quantum compression. 

A n 

State Preparation. The quantum information source outputs a sequence \ip x n ) of 
quantum states according to the ensemble {px{x), \i^x)} where 

\^) An = \iP Xl ) Al ®---®\iP Xn ) An . (17.1) 

The density operator, from the perspective of someone ignorant of the classical sequence x n , 
is equal to the tensor power state p® n where 

p = J2Px(x)\Tp x ){ip x \. (17.2) 

x 

Also, we can think about the purification of the above density operator. That is, an equiva- 
lent mathematical picture is to imagine that the quantum information source produces states 
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Figure 17.1: The most general protocol for quantum compression. Alice begins with the output of some 
quantum information source whose density operator is p® n on some system A n . The inaccessible reference 
system holds the purification of this density operator. She performs some CPTP encoding map £, sends 
the compressed qubits through 2 nR uses of a noiseless quantum channel, and Bob performs some CPTP 
decoding map D to decompress the qubits. The scheme is successful if the difference between the initial 
state and the final state is negligible in the asymptotic limit n — > oo. 



of the form 



I^p) 



HA 



J2Vpx(x)\x) R \ij x y 



(17.3) 



where R is the label for an inaccessible reference system (not to be confused with the rate 
R\). The resulting IID state produced is (\<p p ) RA )® n . 

Encoding. Alice encodes the systems A n according to some CPTP compression map 
gA n ^w w h ere w j s a quantum system of size 2 nR . Recall that R is the rate of compression: 

R = -logd w -5, (17.4) 

n 

where dw is the dimension of system W and 5 is an arbitrarily small positive number. 

Transmission. Alice transmits the system W to Bob using n(R + 5) noiseless qubit 
channels. 

Decoding. Bob sends the system W through a decompression map X> w ^ An . 

The protocol has e error if the compressed and decompressed state is e-close in trace 
distance to the original state {\^p p ) RA )® n : 



^RA^n _ p W ^A« qS A^W^ ^)®«) 



< e. 



(17.5) 



17.2 The Quantum Data Compression Theorem 

We say that a quantum compression rate R is achievable if there exists an (n, R + 5, e) quan- 
tum compression code for all 5, e > and all sufficiently large n. Schumacher's compression 
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Figure 17.2: Schumacher's compression protocol. Alice begins with many copies of the output of the 
quantum information source. She performs a measurement onto the typical subspace corresponding to the 
state p and then performs a compression isometry of the typical subspace to a space of size 2 n '- Hi - p ' +& ' qubits. 
She transmits these compressed qubits over n[H (p) + 6] uses of a noiseless quantum channel. Bob performs 
the inverse of the isometry to uncompress the qubits. The protocol is successful in the asymptotic limit due 
to the properties of typical subspaces. 



theorem establishes the von Neumann entropy as the fundamental limit on quantum data 
compression. 

Theorem 17.2.1 (Quantum Data Compression). Suppose that p A is the density operator 
of the quantum information source. Then the von Neumann entropy H(A) is the smallest 
achievable rate R for quantum data compression: 



inf{i? : R is achievable} = H(A) . 



(17.6) 



17.2.1 The Direct Coding Theorem 

Schumacher's compression protocol demonstrates that the von Neumann entropy H(A) is 
an achievable rate for quantum data compression. It is remarkably similar to Shannon's 



compression protocol from Section |13.4[ but it has some subtle differences that are necessary 
for the quantum setting. The basic steps of the encoding are to perform a typical subspace 
measurement and an isometry that compresses the typical subspace. The decoder then 
performs the inverse of the isometry to decompress the state. The protocol is successful 
if the typical subspace measurement successfully projects onto the typical subspace, and it 
fails otherwise. Just like in the classical case, the law of large numbers guarantees that the 
protocol is successful in the asymptotic limit as n — > 00. Figure 17.2 provides an illustration 
of the protocol, and we now provide a rigorous argument. 

Alice begins with many copies of the state yfp) • Suppose that the spectral decom- 
position of p is as follows: 

p = J2Pz(z)\z)(zl (17.7) 
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where pz(z) is some probability distribution, and {\z)} is some orthonormal basis. Her first 
step £ A "^ YA " ^ g £ Q p er f orm a typical subspace measurement of the form in (14.1.4) onto the 



typical subspace of A n , where the typical projector is with respect to the density operator 
p. The action of £A n ^A n Y Qn a g enera } state a is 

£?-""{*») - |0}(0| Y ® (/- nf )a An {i - uf ) 

+ \l){l\ Y ®Ilfa An nf , (17.8) 
and the classically- correlated flag bit Y indicates whether the typical subspace projection Uf n 



is successful or unsuccessful. Recall from the Shannon compression protocol in Section 13.4 
that we exploited a function / that mapped from the set of typical sequences to a set of 
binary sequences {0,1 } . Now, we can construct an isometry Uf that is a coherent 

version of this classical function /. It simply maps the orthonormal basis {\z n ) } to the 
basis {\f(z»)) w }: 

Uf= E \f(n) W (z n \ A \ (17-9) 

where Z is a random variable corresponding to the distribution pz(z) so that Tf is its 
typical set. The above operator is an isometry because the input space is a subspace of size 



at most 2 n [ H ( p ) +<5 l (recall Property 14.1.2) embedded in a larger space of size 2 n (at least for 
qubits) and the output space is of size at most 2 n i H ( p }+ s \. So her next step £ YAn ^ YW [ s to 
perform the isometric compression conditional on the flag bit Y being equal to one, and the 
action of £|' A "^ yvi/ on a general classical-quantum state a yA " = |0)(0| <g)a An + \l)(l\ ®a An 
is as follows: 

g YA^YW( a YA^ = |0}<0| F ®TrK n }|e}(er 

+ |l)(lf <g>tVf[/j, (17.10) 

where |e) is some error flag orthogonal to all of the states {\f((j> x n )) }^.n^T zn - This last 
step completes the details of her encoder £ An -> YW f and the action of it on the initial state is 

£A^YW^RA^n ) _ {£ YA^YW q £ A^YA^^RA^y (1? _ n) 

Alice then transmits all of the compressed qubits over n[H(p) + 5] + 1 uses of the noiseless 
qubit channel. 

Bob's decoding X> YW ^ A ' 1 performs the inverse of the isometry conditional on the flag 
bit being equal to one and otherwise maps to some other state |e) outside of the typi- 
cal subspace. The action of the decoder on some general classical-quantum state a YW = 
|0}(0r®a w + |l}(lf ®<is 

V YW ^ YAn (a YW ) = |0}<0f ®Tr{a H/ }|e)(e| A " + |l)<l| y ®U}a?U f . (17.12) 

The final part of the decoder is to discard the classical flag bit: X> YAn ^ An =Try{-}. Then 

<£)YW^A n — j)YA n ^A n j~)YW^YA n 
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We now can analyze how this protocol performs with respect to our performance criterion 
in (17.5[). Consider the following chain of inequalities: 



(lO RA ) n — (T) YW ^ A " o £ An ^ YW \((m RA \ 

Tr y {|l)(lf ® {^ A f n } ~ (V YW - An o£ A "- YW )((cpf A ) 
|l)(l| y ® {<p* A f n - (vY w ^ YAn o£ An ^ YW )(( ( pf A ) 9n 



RA\® n s 
P 



\Y 



RA\® n 
P 

^RA\® n 



io(ir®«) 

|0}(0| y ®Tr{(/ 

+|i)(ir®nf«) 



nf 



)«) 0n }|e)(e 



l yiA n 
n <5 



(17.13) 
(17.14) 

(17.15) 



The first equality follows by adding a flag bit |1) to {<pf A ) and tracing it out. The first 
inequality follows from monotonicity of trace distance under the discarding of subsystems 
(Corollary 9.1.2). The second equality follows by evaluating the map J) YW ^ An o£ An ^ YW 



on 



the state (pf A ) ■ Continuing, we have 



< 



i><if®«r-ii><if 



m 



A n (,„RA\® n yrA 



(O 



|0}(0r®Tr{(/-nf)(^)^}|e)(e| 



WT n -nf>frn 

< 2^ + e 



Tr{(/-nf)(^f)^} 



(17.16) 

(17.17) 
(17.18) 



The first inequality follows from the triangle inequality for trace distance (Lemma 9.1.2). The 
equality uses the facts \\p <g> a 



u) 



<y\ 



\P 



\p-uj\\ x and \\bp\\ x 



\P\ 



for some density operators p, a, and u> and a constant b. The final inequality follows from 
the first property of typical subspaces: 



Tr{nf(^f)® n } = Tr{nfp^}>l-6, 



(17.19) 



and the Gentle Operator Lemma (Lemma 9.4.2[). 



We remark that it is important for the typical subspace measurement in (17.8) to be 



implemented as a coherent quantum measurement. That is, the only information that this 
measurement should learn is whether the state is typical or not. Otherwise, there would be 
too much disturbance to the quantum information, and the protocol would fail at the desired 
task of compression. Such precise control on so many qubits is possible in principle, but it 
is of course rather daunting to implement in practice! 



17.2.2 The Converse Theorem 

We now prove the converse theorem for quantum data compression by considering the most 



general compression protocol that meets the success criterion in (17.5) and demonstrating 
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that such an asymptotically error-free protocol should have its rate of compression above the 
von Neumann entropy of the source. Alice would like to compress a state a that lives on a 
Hilbert space A n . The purification (f) RnAn of this state lives on the joint systems A n and R n 
where R n is the purifying system (again, we should not confuse reference system R n with 
rate R). If she can compress any system on A n and recover it faithfully, then she should 
be able to do so for the purification of the state. An (n, R + 5, e) compression code has the 
property that it can compress at a rate R with only error e. The quantum data processing 
is 

A n £ W V A n , (17.20) 

and the following inequality holds for a successful quantum compression protocol: 



R n A n 



where 



oo 



,R n A n 



R n A n 



<€, 



1 



w~~ = D(£(0 K/l )). 
Consider the following chain of inequalities: 



2ni? = log 2 (2" i? )+log 2 (2" R ) 
>\H(W)J + \H(W\R n ) 
>\H(W)„-H(W\R n )„ 
= I{W-R n ) u 



'17.21' 



(17.22) 



(17.23) 
(17.24) 
(17.25) 
(17.26) 



The first inequality follows from the fact that both the quantum entropy H(W) and the con- 
ditional quantum entropy H(W\R n ) cannot be larger than the logarithm of the dimension 



of the system W (see Property 11.1.3 and Theorem 11.5.1). The second inequality is the tri- 



angle inequality. The second equality is from the definition of quantum mutual information. 
Continuing, we have 



> / 



(>; 



R 1 



>I[A n ;R n 



ne 



I{A n -R n )^-ne' 
H(A n )^ + H(R n ) 
2H(A n ), - ne' 



H(A n R n ), - ne' 



(17.27) 
(17.28) 

(17.29) 
(17.30) 
(17.31) 



The first inequality follows from the quantum data processing inequality (Bob processes 
W with the decoder to get A n ). The second inequality follows from applying the Alicki- 



Fannes' inequality (see Exercise 11.9.7) to the success criterion in (17.21) and setting e' = 
6eR + 4i?2(e)/n. The first equality follows because the systems A n and A n are isomorphic 
(they have the same dimension). The second equality is by the definition of quantum mu- 
tual information, and the last equality follows because the entropies of each half of a pure, 
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bipartite state are equal and their joint entropy vanishes. In the case that the state <f> is an 
IID state of the form (\<p p ) RA )® n from before, the von Neumann entropy is additive so that 

H(A n ) M9n =nH(A) M , (17.32) 

and this shows that the rate R must be greater than the entropy H(A) if the error of the 
protocol vanishes in the asymptotic limit. 

17.3 Quantum Compression Example 

We now highlight a particular example where Schumacher compression gives a big savings 
in compression rates if noiseless quantum channels are available. Suppose that the ensemble 
is of the following form: 

>)),(}, | + ))}. (17.33) 

This ensemble is known as the Bennett-92 ensemble because it is useful in Bennett's protocol 
for quantum key distribution. The naive strategy would be for Alice and Bob to exploit 
Shannon's compression protocol. That is, Alice would ignore the quantum nature of the 
states, and supposing that the classical label for them were available, she would encode the 
classical label. Though, the entropy of the uniform distribution on two states is equal to one 
bit, and she would have to transmit classical messages at a rate of one bit per channel use. 
A far wiser strategy is to employ Schumacher compression. The density operator of the 
above ensemble is 

^|0}(0| + ^|+}(+|, (17.34) 
which has the following spectral decomposition: 

cos 2 (tt/8)|+ , }(+ , | + sin 2 (7r/8)|-')(-'|, (17.35) 

where 

|+'} = cos(tt/8)|0) + sin(7r/8)|l), (17.36) 

|-') = sin(7r/8)|0) - cos(tt/8)|1). (17.37) 

The binary entropy iJ2( c °s 2 (7r/8)) of the distribution [cos 2 (7r/8), sin 2 (7r/8)] is approximately 
equal to 

0.6009 qubits, (17.38) 

and thus they can save a significant amount in terms of compression rate by employing 
Schumacher compression. This type of savings will always occur whenever the ensemble 
includes non-orthogonal quantum states. 

Exercise 17.3.1 In the above example, suppose that Alice associates a classical label with 
the states, so that the ensemble instead is 

^|0)(0|®|0)(0|),Q,|l)(l|®|+)(+|)}. (17.39) 

Does this help in reducing the amount of qubits she has to transmit to Bob? 
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17.4 Variations on the Schumacher Theme 

We can propose several variations on the Schumacher compression theme. For example, 
suppose that the quantum information source corresponds to the following ensemble instead: 

{px(x),p*}, (17.40) 

where each p x is a mixed state. Then the situation is not as "clear-cut" as in the simpler 
model for a quantum information source because the techniques exploited in the converse 
proof do not apply here. Thus, the entropy of the source does not serve as a lower bound on 
the ultimate compressibility rate. 

Let us consider a special example of the above situation. Suppose that the mixed states 
p x live on orthogonal subspaces, and let p A = J2 x Px( x )Px denote the expected density 
operator of the ensemble. These states are perfectly distinguishable by a measurement 
whose projectors project onto the different orthogonal subspaces. Alice could then perform 
this measurement and associate classical labels with each of the states: 

p XA ^Y,P^)\x){x\ X ®pl (17.41) 

X 

Furthermore, she can do this in principle without disturbing the state in any way, and 
therefore the entropy of the state p XA is equivalent to the original entropy of the state p A : 

H(A) p = H(XA) p . (17.42) 

Naively applying Schumacher compression to such a source is actually not a great strategy 
here. The compression rate would be equal to 

H(XA) p = H(X) p + H(A\X) p , (17.43) 

and in this case, H(A\X) > because the conditioning system is classical. For this case, 
a much better strategy than Schumacher compression is for Alice to measure the classical 
variable X, compress it with Shannon compression, and transmit to Bob so that he can 
reconstruct the quantum states at his end of the channel. The rate of compression here is 
equal to the Shannon entropy H(X) which is provably lower than H(XA) for this example. 
How should we handle the mixed source case in general? Let's consider the direct coding 
theorem and the converse theorem. The direct coding theorem for this case is essentially 
equivalent to Schumacher's protocol for quantum compression — there does not appear to be 
a better approach in the general case. The density operator of the source is equal to 

p A = Y J Px{x)pl (17.44) 

X 

A compression rate R > H(A) is achievable if we form the typical subspace measurement 
from the typical subspace projector 11^" onto the state (p A ) . Although the direct coding 
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theorem stays the same, the converse theorem changes somewhat. A purification of the 
above density operator is as follows: 



,XX'RA 



YsVpxWlxflxf'l^f*, (17.45) 



where each \4> Px ) is a purification of p£. So the purifying system is the joint system XX' R. 
Let uj xx RA be the actual state generated by the protocol: 

u} XX ' RA = V{S{<j> xx ' RA )). (17.46) 

We can now provide an alternate converse proof: 

nR = \og 2 d w (17.47) 

> H(W) U (17.48) 

>H(W) U -H(W\X) U (17.49) 

= I{W-X) u (17.50) 

>I(A,X) U (17.51) 

>I(A;X) <j> -ne' (17.52) 

The first equality follows from evaluating the logarithm of the dimension of system W . The 
first inequality follows because the von Neumann entropy is less than the logarithm of the 
dimension of the system. The second inequality follows because H{W\X) > — the system 
X is classical when tracing over X' and R. The second equality follows from the definition of 
quantum mutual information. The third inequality follows from quantum data processing, 
and the final follows from applying the Alicki-Fannes' inequality (similar to the way that we 
did for the converse of the quantum data compression theorem). Tracing over X' and R of 
|0) gives the following state 

Y,Px{x)\x)(x\ x ®pt (17.53) 

X 

demonstrating that the ultimate lower bound on the compression rate of a mixed source is 
the Holevo information of the ensemble. The next exercise asks you to verify that ensembles 
of mixed states on orthogonal subspaces saturate this bound. 

Exercise 17.4.1 Show that the Holevo information of an ensemble of mixed states on or- 
thogonal subspaces has its Shannon information equal to its Holevo information. Thus, this 
is an example of a class of ensembles that meet the above lower bound on compressibility. 



17.5 Concluding Remarks 



Schumacher compression was the first quantum Shannon-theoretic result discovered and is 
the simplest one that we encounter in this book. The proof is remarkably similar to the 
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proof of Shannon's noiseless coding theorem, with the main difference that we should be 
more careful in the quantum case not to be learning any more information than necessary 
when performing measurements. The intuition that we gain for future quantum protocols 
is that it often suffices to consider only what happens to a high probability subspace rather 
than the whole space itself if our primary goal is to have a small probability of error in a 
communication task. In fact, this intuition is the same needed for understanding informa- 
tion processing tasks such as entanglement concentration, classical communication, private 
classical communication, and quantum communication. 

The problem of characterizing the lower and upper bounds for the quantum compression 
rate of a mixed state quantum information source still remains open, despite considerable 
efforts in this direction. It is only in special cases, such as the example mentioned in Sec- 



tion 17.4, that we know of a matching lower and upper bound as in Schumacher's original 



theorem. 



17.6 History and Further Reading 



Ohya and Petz devised the notion of a typical subspace |200| . and Schumacher indepen- 
dently introduced typical subspaces and proved the quantum data compression theorem 
in Ref. [216] . Jozsa and Schumacher later generalized this proof |168| . and Lo further 
generalized the theorem to mixed state sources [186J. There are other generalizations in 
Refs. |147l[T3~] . Several schemes for universal quantum data compression exist 116611167], 1ST], 
in which the sender does not need to have a description of the quantum information source in 
order to compress its output. There are also practical schemes for quantum data compression 
discussed in work about quantum Huffman codes [48]. 
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CHAPTER 18 



Entanglement Concentration 



Entanglement is one of the most useful resources in quantum information processing. If a 
sender and receiver share noiseless entanglement in the form of maximally entangled states, 
then Chapter [6] showed how they can teleport quantum bits between each other with the 
help of classical communication, or they can double the capacity of a noiseless qubit channel 



for transmitting classical information. We will see further applications in Chapter 20 where 
they can exploit noiseless entanglement to assist in the transmission of classical or quantum 
data over a noisy quantum channel. 

Given the utility of maximal entanglement, a reasonable question is to ask what a sender 
and receiver can accomplish if they share pure entangled states that are not maximally 
entangled. In the quantum Shannon-theoretic setting, we make the further assumption that 
the sender and receiver can share many copies of these pure entangled states. We find 
out in this chapter that they can "concentrate" these non-maximally entangled states to 
maximally entangled ebits, and the optimal rate at which they can do so in the asymptotic 
limit is equal to the "entropy of entanglement" (the von Neumann entropy of half of one 
copy of the original state). Entanglement concentration is thus another fundamental task in 
noiseless quantum Shannon theory, and it gives a different operational interpretation to the 
von Neumann entropy. 

Entanglement concentration is perhaps complementary to Schumacher compression in the 
sense that it gives a firm quantum information theoretic interpretation of the term "ebit" 
(just as Schumacher compression did so for the term "qubit"), and it demonstrates how the 
entropy of entanglement is the unique measure of entanglement for pure bipartite states. 
Despite the similarity to Schumacher compression in this respect, entanglement concentra- 
tion is a fundamentally different protocol, and we will see that these two protocols are not 
interchangeable. That is, exploiting the Schumacher compression protocol for the task of 
entanglement concentration fails at accomplishing the goal of entanglement concentration, 
and vice versa. 

The technique for proving that the von Neumann entropy is an achievable rate for en- 



tanglement concentration exploits the method of types outlined in Sections 13.7 and 14.3 



for classical and quantum typicality, respectively (the most important property is Prop- 
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erty |13.7.5| which states that the exponential of the entropy is a lower bound on the size 
of a typical type class). In hindsight, it is perhaps surprising that a typical type class is 
exponentially large in the large n limit (on the same order as the typical set itself), and we 
soon discover the quantum Shannon-theoretic consequences of this result. 

We begin this chapter by discussing a simple example of entanglement concentration 



for a finite number of copies of a state. Section 18.2 then details the information process- 
ing task that entanglement concentration attempts to accomplish, and Section 18.3 proves 
both the direct coding theorem and the converse theorem for entanglement concentration. 
We then discuss how common randomness concentration is the closest classical analog of 
the entanglement concentration protocol. Finally, we discuss the differences between Schu- 
macher compression and entanglement concentration, especially how exploiting one protocol 
to accomplish the other's information processing task results in a failure of the intended goal. 

18.1 An Example of Entanglement Concentration 

A simple example illustrates the main idea underlying the concentration of entanglement. 
Consider the following partially entangled state: 

\<S> e ) AB = cos(#)|00) AS + sm(9)\ll) AB , (18.1) 

where 6 is some parameter such that < 6 < ir/2. The Schmidt decomposition (Theo- 
rem 3.6.1) guarantees that the above state is the most general form for a pure bipartite 
entangled state on qubits. Now suppose that Alice and Bob share three copies of the above 
state. We can rewrite the three copies of the above state with some straightforward algebra: 

\$ e ) AlBl \$e) A2B2 \<S>e) A3B3 

= cos 3 (#)|000) A |000) B + sin 3 (#)|lll) A |lll} B (18.2) 

+ cos(#) sin 2 (#) (|110} A |110) B + |101} A |101) S + |011} A |011) B ) 

+ cos 2 (#) sin(0) (|100} A |100) B + |010} A |010) B + |100} A |100) B ) 

= cos 3 (#)|000) A |000) B + sin 3 (#)|lll) A |lll} B (18.3) 

+ \/3cos(#) sin 2 (#)^(|110) A |110) B + |101} A |101) B + |011} A |011) B 
v 3 v 

+ V3cos 2 {9) sin(#)4=(|100) A |100) B + |010} A |010) S + |100} A |100) B 
v 3 v 

where we relabel all of the systems on Alice and Bob's respective sides as A = AiA 2 A 3 
and B = B1B2B3. Observe that the subspace with coefficient cos 3 (#) whose states have 
zero "ones" is one- dimensional. The subspace whose states have three "ones" is also one- 
dimensional. But the subspace with coefficient cos(#) sin 2 (#) whose states have two "ones" 
is three-dimensional, and the same holds for the subspace whose states each have one "one." 

©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



18.1. AN EXAMPLE OF ENTANGLEMENT CONCENTRATION 447 



A protocol for entanglement concentration in this scenario is then straightforward. Alice 
performs a projective measurement consisting of the operators EEo, III, II2, II3 where 

n = |000)(000| A , (18.4) 

III = |001)(001| A + |010}(010| A + |100}(100| A , (18.5) 

n 2 = |110)(110| A + |101}(101| A + |011}(011| A , (18.6) 

n 3 = |m)(m| A . (18.7) 

The subscript i of the projection operator IIj corresponds to the Hamming weight of the 
basis states in the corresponding subspace. Bob can perform the same "Hamming weight" 
measurement on his side. With probability cos 6 (#) +sin 6 (#), the procedure fails because it 
results in |000) |000) or |111) |111) which is not a maximally entangled state. But with 
probability 3 cos 2 (6) sin 4 (#), the state is in the subspace with Hamming weight two, and it 
has the following form: 

-^(|110} A |110) S + |101} A |101) B + |011> A |011) B ), (18.8) 

and with probability 3 cos 4 (#) sin 2 (#), the state is in the subspace with Hamming weight one, 
and it has the following form: 

-^(|100} A |100) B + |010} A |010) B + |100} A |100} B ). (18.9) 

Alice and Bob can then perform local operations on their respective systems to rotate either 
of these states to a maximally-entangled state with Schmidt rank three: 

-^(|0} A |0) B + |1} A |1} S + |2} A |2} S ). (18.10) 

The simple protocol outlined above is the basis for the entanglement concentration pro- 
tocol, but it unfortunately fails with a non-negligible probability in this case. On the other 
hand, if we allow Alice and Bob to have a potentially infinite number of copies of a pure 
bipartite entangled state, the probability of failing becomes negligible in the asymptotic limit 
due to the properties of typicality, and each type class subspace contains an exponentially 



large maximally entangled state. The proof of the direct coding theorem in Section 18.3.1 
makes this intuition precise. 

Generalizing the procedure outlined above to an arbitrary number of copies is straight- 
forward. Suppose Alice and Bob share n copies of the partially entangled state \$g). We 
can then write the state as follows: 



^r sn =±J( n \o S ^(0)sin\e) -L= Y, i*ri*r , (mi) 




: w(x)=k 
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where w(x) is the Hamming weight of the binary vector x. Alice performs a "Hamming 
weight" measurement whose projective operators are as follows: 

n fc = J2 \ x )( x \ An i ( 18 - 12 ) 

x : w(x)=k 

and the Schmidt rank of the maximally entangled state that they then share is (?) . 

We can give a rough analysis of the performance of the above protocol when n becomes 
large by exploiting Stirling's approximation (we just need a handle on the term (™) for 
large n). Recall that Stirling's approximation is n! fa \/27rn(n/e) n , and this gives 



n\ n\ 



k J kin — k\ 

y/2im(n/e) 



V2^k(k/e) k y/2ir(n - k){(n - k)/e) 



n—k 



n n 



(18.13) 
(18.14) 
(18.15) 



2irk(n -k)(n- k) n ~ k k k 

poly(n) [—) (-) (18.16) 

poly(n) 2 n [~^ n_fc ) //n ) log ^ n ~ fc )/ n )~( A: / n ) log ( fc//n )] (18.17) 

poly(n) 2 nH2(k/n \ (18.18) 



where H2 is the binary entropy function in (1.1) and poly(n) indicates a term at most 
polynomial in n. When n is large, the exponential term 2 nH2 ( fc / n ) dominates the polynomial 
y/n/2irk(n — k), so that the polynomial term begins to behave merely as a constant. So, 
the protocol is for Alice to perform a typical subspace measurement with respect to the 
distribution (cos 2 (#), sin 2 (6>)), and the state then collapses to the following one with high 
probability: 



jr t i(:)--'w^)U= E i^» s " . (18.19) 

fc=0 : V V ' V J (J x : W ( x ) =k I 

|fc/n- S in 2 (e)| <S, V V ' 

|(n-fc)/n-cos 2 (6»)| < <5 

where M is an appropriate normalization constant. Alice and Bob then both perform a 
Hamming weight measurement and the state collapses to a state of the form: 

1 = V \x) An \x) B \ (18.20) 

VP0ly(n) 2^(*/») x ,^ x) J 

depending on the outcome k of the measurement. The above state is a maximally entangled 
state with Schmidt rank poly(n) 2 nH ^ k ' n \ and it follows that 

H 2 (k/n) > H 2 (cos 2 {6)) - 5, (18.21) 
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from the assumption that the state first projects into the typical subspace. Alice and Bob 
can then perform local operations to rotate this state to approximately nH2(cos 2 (9)) ebits. 
Thus, this procedure concentrates the original non- m axima lly state to ebits at a rate equal 



to the entropy of entanglement of the state |$g) in (18.1 ). The above proof is a bit rough, 



and it applies only to entangled qubit systems in a pure state. The direct coding theorem 



in Section 18.3.1 generalizes this proof to pure entangled states on d-dimensional systems. 



18.2 The Information Processing Task 

We first detail the information processing task that entanglement concentration sets out to 
accomplish. An (n, E — 5,e) entanglement concentration protocol consists of just one step of 
processing. Alice and Bob begin with many copies (\<p) )® n of a pure bipartite, entangled 
state \<p) . Alice and Bob each then perform local CPTP maps £ All ^ A and j? Bn ^ B in an 
attempt to concentrate the original state (\<p) )® n to a maximally entangled state: 

U AB _ ^A'^A 3 T B^B \ (pAWy (1822) 

The protocol has e error if the final state u AB is e-close to a maximally entangled state |$) : 



^AB _ ^AB 



1 



< e, (18.23) 



where 



|$r^f>V, (18-24) 

^ u i=0 
and the rate E of ebit extraction is 

E = -log 2 (D) + 6, (18.25) 

n 

where 5 is some arbitrarily small positive number. We say that a particular rate E of 
entanglement concentration is achievable if there exists an (n, E — 8,e) entanglement con- 



centration protocol for all e, 5 > and sufficiently large n. Figure 18.1 displays the operation 
of a general entanglement concentration protocol. 

18.3 The Entanglement Concentration Theorem 

We first state the entanglement concentration theorem and then prove it below in two parts 
(the direct coding theorem and the converse theorem). 

Theorem 18.3.1 (Entanglement Concentration). Suppose that \<p) J is a pure bipartite 
state that Alice and Bob would like to concentrate. Then the von Neumann entropy H(A) 
is the highest achievable rate E for entanglement concentration: 

sup{£ : E is achievable} = H{A) (18.26) 
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Alice 



A" 




£ 



A 




Bob 



B" 



T 



B 



Figure 18.1: The most general protocol for entanglement concentration. Alice and Bob begin with many 



copies of some pure bipartite state \ip) 
a maximally entangled state. 



AB 



They then perform local operations to concentrate this state to 



18.3.1 The Direct Coding Theorem 

The proof of the direct coding theorem demonstrates the inequality LHS > RHS in The- 



orem 



18.3.1[ We first outline the technique and then provide a detailed proof. Suppose 

that has a 



that Alice and Bob share many copies of a pure, bipartite entangled state \ip) 
Schmidt decomposition of the following form: 



AB 



\v>) 



AB 



^2Vpx(x)\x) A \xY 



(18.27) 



xtX 



The fact that the states of Alice and Bob in the above superposition are coordinated via the 
distribution px(x) is what leads to the possibility of performing entanglement concentration 
(we are assuming that the state above has a Schmidt rank greater than one so that there 
is some entanglement in it — the protocol given here does not extract any entanglement 
otherwise). Many copies of the above state admit a type decomposition into different type 
class subspaces, where each state in a given type class subspace is maximally entangled 



(just as we observed in the example from Section 18.1). Alice and Bob both first perform 



a typical subspace measurement onto their n local systems. The fact that their states are 
coordinated via the distribution px (x) implies that they receive the same outcome — the result 
of the measurement is either typical or atypical for both of them, and the successful typical 
outcome occurs with high probability in the asymptotic limit. They then perform a type 
measurement, which in the case of qubits corresponds to a "Hamming weight" measurement. 
Both of them receive the same outcome for this measurement, and their resulting state is 
now maximally entangled in the same type class subspace. Furthermore, we know that the 
size of this type class subspace is larger than « 2 nH ^ A 'f in the large n limit because a typical 



type class has the aforementioned size (Property 13.7.5). They then both perform another 
projective measurement to bin this subspace into smaller ones in order to guarantee that 
their maximally entangled state lives in a subspace whose dimension is a power of two (so 
that they can get ebits) and has the same size for all possible types. Finally, conditional on 
the type and the bin, they each perform an isometry that rotates their state to cs H(A) 
ebits. Failure of the above protocol can only occur at two points in this protocol: if the 
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outcome of the typical subspace measurement fails or if the second projection onto a slightly 
smaller subspace fails. Both failures occur with negligible probability in the asymptotic 
limit. 

We now provide a rigorous proof that the above protocol works as claimed. Consider 
taking many copies of the state \<p) : 

(\^) AB y n = E VpM^IxY^xY' 1 - (18-28) 

x n &X n 

We can write the above state in terms of its type decomposition: 

wr Bn =ee vs^)i^ n ) A " \- n r (18-29) 

t x n £T t 

= E V^W) E \* n ) An \ xn f n ( 18 - 3 °) 

t x n £T t 

= E vWOtfH-^F E !*">*>">*" (18-31) 

= E VH*)!**)^*"- (18-32) 



The first equality follows by decomposing the state into its different type class subspaces. 
The next equality follows because px n {% n ) is the same for all sequences x n in the same 
type class and because the distribution is IID (let x" be some representative sequence of all 
sequences in the type class T t ). The third equality follows by introducing d t as the dimension 
of a type class subspace T t , and the final equality follows from the definitions 

p(t)=p X n(x?)dt, (18.33) 

\$t) AnBn = 4= EbVV)*"- (18-34) 

ATI DTI 

Observe that the state |$t) is maximally entangled. 

Alice's first action gA n -*YA n is to perform a typical subspace measurement of the form 



in (14.1.4) onto the typical subspace of A n , where the typical projector is with respect 
to the density operator p = ^2 x Px{x)\x){x\ (this first step is the same as in Schumacher 
compression). The action of £A n ^A n Y on a g enera j state a A " is 

*r-*™V") - |o)(o| y ® (/- nf)a An (i - nf ) 

+ |i)<i| y ®itfV l "nf, (18.35) 



and the classically-correlated flag bit Y indicates whether the typical subspace projection 
If 



Hf" is successful or unsuccessful. Conditional on the typical subspace measurement being 
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successful (the flag bit being equal to one), she next performs a type class subspace mea- 
surement £Y An -> YTAn that places the type in a classical register T. Its action on a general 
classical-quantum state a YA ' 1 = |0)(0| <8> o{f + |1)(1| <8> uf l is as follows: 



£ YA^YTA^ a YA^ = |Q)(0| Y ^ |fi)(e| T ^ ^ 



Ei!)d 



y ®|t)(t| T ®n t af U t , (18.36) 



where {n t } t are the elements of the type class subspace measurement (recall from (14.118) 
that the typical projector decomposes into a sum of the type class projectors) and |e) is some 
state that is orthogonal to all of the types \t). Each type class projector II t projects onto a 
subspace of size at least 

2nH(A) tp -ri{dS)-dlog(n+l) (18.37) 

where 5 is the typicality parameter, d is the dimension of system A, and rj(x) is a function 



such that \\m x ^Qr](x) = 0. This lower bound follows by exploiting Property 14.3.2 The 
key observation to make at this point is that Alice and Bob's shared state at this point is a 
maximally entangled state of the following form: 

iEi^K- ( 18 - 38 ) 

V\J-t\ x n eTt 

This result follows because the distribution px n {x n ) becomes uniform when conditioned on 



a particular type (see the discussion in Section 13.7). Let d t be the Schmidt rank of the 
entangled state above. 

They now just need to "chop" the above maximally entangled state down to a state of m 
ebits. Conditional on the flag bit being equal to one and the T register not being equal to 
|e), Alice and Bob agree beforehand on a partition of their spaces into one of size (1 — t\)dt 
and another of size t\dt where t\ > 0. They then further partition the larger space of 
size (1 — ti)d t into A bins, each of size 2 m where 



m 



nH(A) -nrt(d5)-d\og(n + l)+\og(l-e 1 ) , (18.39) 



so that (1 — e±)dt = A 2 m (they can make the other register of size t\d t smaller and A larger 
if need be so that A is an integer). Conditional on the flag bit being equal to one and the 
T register not being equal to |e), Alice then performs a projective measurement onto this 
partitioned type class. The action of this measurement £ YTAn ^ YSTAn on a classical-quantum 
state of the form 

a YTAn = |0)<0f ® |e)(ef ® af + J>)(lf ® \t)(t\ T ® of (18.40) 

t 

is as follows: 

£ YTA^YSTA^ a YTA^ = ]Q) {Q] Y ^ ]Q) ^S ^ ^ {&] T ^ ^ 

+ J2 \1)(M Y ®\s)(s\ S ®\t)(t\ T ®Il s>t af n n s>t , (18.41) 

t,s : Sj^Q 
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where U ot is a projector indicating an unsuccessful projection and n si is a projector indi- 



cating a successful projection onto one of the subspaces of size given in (18.39). Alice's final 



ysta ^ g j. Q p er f orm an isometry U An ^ A conditional on the Y register 



nY <?T A n 

processing step b, 

being equal to one, the particular value in the S register, and conditional on the type t, and 
otherwise trace out A n and replace it by some state orthogonal to all the binary numbers in 
the set {0, l} m (indicating failure). The isometry l/A n ^ A is a coherent version of a function 
g that maps the sequences x n in the type class T[ to a binary number in {0, 1}™: 



a 



A n ^A 






(18.42) 



We place a pr ime on the type class T t because it is a set slightly smaller than T t due to the 
projection in (18.41). Alice then traces out the systems Y, S, and T 
this action. This last step completes Alice's actions, and let 8 
her "entanglement concentrator": 



let £YSTA-^a denote 
* A denote the full action of 



8 



A n ^A 



cYSTA- 
c 5 



o£. 



YSTA" 



*YSTA cYTA n 
o c 3 



*YSTA" 



8 



YA" 



*YTA n 



~ cA 
o t 1 



*YA n 



(18.43) 



We have outlined all of Alice's steps above, but what should Bob do? It turns out 
that he should perform the exact same steps. Doing the same steps guarantees that he 



AB 



receives the same results at every step because the initial state \ip) has the special form 



in (18.27) from the Schmidt decomposition. That is, their classical registers Y, S, and T 



are perfectly correlated at every step of the concentration protocol. We should also make 
a point concerning the state after Alice and Bob complete their fourth processing step £4. 
Conditional on the Y and S registers being equal to one (which occurs with very high 
probability in the asymptotic limit), the state on systems A and B is equal to a state of m 
ebits |$ + ) m , and so the protocol is successful. 

We now perform the error analysis of this protocol. Let uj ab denote the state at the end 
of the protocol: 

AB _ {£ A^A 3 fB^B^B-^ (lg _ 44) 



U) 



where T indicates Bob's steps that are the same as Alice's in (18.43). Consider the following 
chain of inequalities: 



U) 



< 



AB 



(£ 



(*+y 



A n ^A 



^ B "^)(^ nBn )-($ + ) C 



(£4 o £3 ° £2 o £1) ® (T 4 of 3 o^° ^l) (<p AnBn ) 



E 



s : s^0,t£rs M 



^Tr{n S)t n f nf ^" B "}|i)(ir ® \s,t)(s,t\ bl ® ($+) 



(18 


45) 


(18 


46) 


(18 
1 


47) 



The first equality follows by definition, and the inequality follows from monotonicity of the 
trace distance under CPTP maps (recall that the last operations £5 and JF 5 are just tracing 
out). Also, M is an appropriate normalization constant related to the probability of the 
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typical subspace and the probability of projecting correctly onto one of the blocks of size 

M = 



given in (18.39): 



J2 Tr{n s ,JIJlf</" B "}>l-e- ei . 

s : Sy^0,t^rg 



(18.48) 



The first inequality follows from the result of Exercise 4.1.11 (note that Alice and Bob both 
have access to the registers Y, S, and T because their operations give the same results). 
Continuing, the last term in ( |18.47 ) is bounded as 



< 



Tr{n s , t n t nf^^"}|i)(i| y ®| s ,t}( s ,t| 5T 



- Zs:ZZs irTr{iwn t nf ^ B "}|i)(i 



Y 



£|l)(l| y ® |0}(0| S <8> \t)(t\ T <g> \ee)(ee\ AnBn 



|0}(0| F <g> ®|0}(0| 5 <g> |e)(e| T <g> \ee)(ee 



n s,t)(s,t\ S ®($+f m 

Tr{n , t n t nf(^"^)} 

AnBn Tr{(l-Ilf)cp AnBn } 



(18.49) 



<|AT 



£ ^Tr{n s , t n t nf> 



A n An B n 



S : S^O, 



}ii)(ir®i s ,t)( S ,ti 5T ®($ + ) ( 



+ ei + e 
< ei + e + ei 
= 2ei + 2e. 



(18.50) 
(18.51) 
(18.52) 



The first inequality follows by noting that the successive operations £4 o £ 3 o £ 2 o E\ and 
those on Bob's side break into three terms: one corresponding to a successful concentration, 
an error term if the projection onto blocks of size 2 m fails, and another error term if the 
first typical subspace projection fails. It also follows from an application of the triangle 
inequality. The second inequality follows by pulling the normalization constant out of the 
trace distance, by noting that the probability of the projection onto the remainder register is 
no larger than e±, and by applying the bound on the typical projection failing. The last few 



inequalities follow from the bound in (18.48). The rate of the resulting maximally entangled 
state is then 



-log(2 m ) = - 
n n 



nH(A) - nq{dS) - d\og(n + 1) + log(l - e x ) 



(18.53) 



This rate becomes asymptotically close to the entropy of entanglement H(A) in the limit 



where S, e± — ► and n — > 00. We have thus proven that LHS > RHS in Theorem 18.3.1 , and 
we have shown the following resource inequality: 



W 



AB\ 



> H{A)\qq\. 



;i8.54) 



That is, beginning with n copies of a pure, bipartite entangled state (p AB , Alice and Bob can 
extract nH(A) ebits from it with negligible error probability in the asymptotic limit. 
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18.3.2 The Converse Theorem 

We now prove the c onverse theorem for entanglement concentration, i.e., the inequality LHS 
< RHS in Theorem 18.3.1 . Alice and Bob begin with many copies of the pure state \<p) 
the most general protocol given in Figure |l8.1 , they both perform local CPTP maps £ A 
and f Bn ^ B to produce the following state: 



In 



U) 



,AB 



E 



A n_ 



T L 



(^")- 



;i8.55) 



If the protocol is successful, then the actual state uo AB is e-close to the ideal 
entangled state $ AB : 

,.\B foAB 



U! 



< e. 



Consider the following chain of inequalities: 

2nE = 2H(A) 9 

= H(A)* + H(B)<s, - H(AB)$ 
= I(A;B)* 

< I (A; B)„ + ne' 

>n 



< I{A n ; B n " 



ne 



= H(A n )^ n + H(B n )^ n - H(A n B n )^ 
= 2nH(A) +ne'. 



ne 



jumally 


(18.56) 


(18.57) 


(18.58) 


(18.59) 


(18.60) 


(18.61) 


(18.62) 


(18.63) 



The first equality follows because the entropy of entanglement H(A)$ of a maximally en- 
tangled state $ AB is equal to the logarithm of its Schmidt rank. The next equality follows 
because H(B)§ = H(A)§ and H(AB)§ = for a pure bipartite entangled state (see Theo- 
rem 11.2.1). The third equality follows from the definition of quantum mutual information. 



The first inequality follows from applying the Alicki-Fannes' inequality for quantum mutual 

6elog|A| + AH 2 {e)/n (see Exercise |11.9.7| ). 

and B n . 



information to (18.56) with 



The second 



inequality follows from quantum data processing of both A n and B n . The final equalities 
follow from the same arguments as the first two equalities and because the entropy of a 
tensor product state is additive. 



18.4 Common Randomness Concentration 

We now briefly discuss common randomness concentration, which is the closest classical 
analog to entanglement concentration. The goal of this protocol is to extract uniformly 
distributed bits from non-uniform common randomness. This discussion should help give 
insight from the classical world into the entanglement concentration protocol. Our discussion 
merely "hits the high points" of the protocol without being as detailed as in the direct part of 
the entanglement concentration theorem. Suppose that Alice and Bob begin with a correlated 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



456 CHAPTER 18. ENTANGLEMENT CONCENTRATION 



joint probability distribution px(x)5(x,y) that is not necessarily maximally correlated (if it 
were maximally correlated, then the distribution px(x) would be uniform, and there would 
be no need for common randomness concentration). Now suppose that they have access to 
many copies of this distribution: 

p X n(x n )5(x n ,y n ). (18.64) 

The steps in the protocol for common randomness concentration are similar to those in 
entanglement concentration. The protocol begins with Alice recording whether her sequence 
x n is in the typical set Tf n . If the sequence is typical, she keeps it, and the resulting 
probability distribution is 

p l X n } Px«(x n )5(x n ,y n ), (18-65) 

where x n G Tf n and PrJT^"} > 1 — e follows from the properties of typical sequences. The 
following non-typical probability distribution occurs with probability e: 

px*(x n )6(x n ,y n ) 

i-Pr{rn ' (18 - 66) 

and she declares a failure if the sequence is not typical. 

Continuing, Alice then determines the type of the probability distribution in (18.65). The 
key point here (as for entanglement concentration) is that all sequences within the same type 
class have the same probability of occurring. Thus, conditioned on a particular type, the 
resulting sequences have a uniform distribution. The distribution resulting from determining 
the type is 

T^-Mx n ,y n ), (18.67) 

\ 1 t\ 

where x n G T t . The above probability distribution is then a uniform distribution of size at 
least 

' r2 nH ®, (18.68) 



(n + 1) 



\xy 



by applying the bound from the proof of Property |13.7.5| This size is slightly larger than 
the following size 

2 n{H{X)- nm5 ))^ (1 g_ 69) 



(n + 1) 1 * 1 ' 

because t is a typical type. Bob performs the exact same steps, and they both then perform 
local transformations of the data to convert their sequences to uniform random bits. The 
rate of the resulting uniform bit extraction is then 

- log( / 1 2"C g W^<l*l*») = H(X) - V (\X\S) - -\X\ \og(n + 1). (18.70) 

This rate becomes asymptotically close to the entropy H(X) in the limit where 5^0 and 

n — > oo. 
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18.5 Schumacher Compression versus Entanglement Con- 
centration 

The tasks of Schumacher compression from the previous chapter and entanglement con- 
centration from the current chapter might seem as if they should be related — Schumacher 
compression compresses the output of a quantum information source to its fundamental 
limit of compressibility, and entanglement concentration concentrates entanglement to its 
fundamental limit. A natural question to ask is whether Alice and Bob can exploit the 
technique for Schumacher compression to accomplish the information processing task of en- 
tanglement concentration, and vice versa, can they exploit entanglement concentration to 
accomplish Schumacher compression? Interestingly, the answer is "no" to both questions, 
and in fact, exploiting one protocol to accomplish the other's information processing task 
performs remarkably poorly This has to do with the differences between a typical sub- 
space measurement and a type class subspace measurement, which are the building blocks 
of Schumacher compression and entanglement concentration, respectively. 

First, let us consider exploiting the entanglement concentration protocol to accomplish 
the goal of Schumacher compression. The method of entanglement concentration exploits 
a type class subspace projection, and it turns out that such a measurement is just too 
aggressive (disturbs the state too much) to perform well at Schumacher compression. In 
fact, the protocol for entanglement concentration is so poor at accomplishing the goal of 
Schumacher compression that the trace distance between the projected state and the original 
state reaches its maximum value in the asymptotic limit, implying that the two states are 
perfectly distinguishable! We state this result as the following theorem. 

Theorem 18.5.1. Suppose that the type class subspace measurement projects onto an em- 
pirical distribution slightly different from the distribution p x (x) in the spectral decomposition 
of a state p. Then the trace distance between a concentrated state and the original state 
approaches its maximum value of two in the asymptotic limit. Furthermore, even if the 
empirical distribution of the projected type is the same as px{x), the projected state never 
becomes asymptotically close to the tensor power state p® n . 

Proof. Consider that the first action of entanglement concentration is to perform a typical 
subspace measurement followed by a type measurement. These actions result in some state 
of the following form: 

r 1 - : ITp 0W lT, (18.71) 

Tr{n t p®"} w y ' 

where t is a typical type and the basis for the type class projectors is the eigenbasis of 
p = ^2 x px(x)\x)(x\. Also, consider that we can write the original state p® n as follows: 

p ® n = j2 n t ,p® n = J2 n t 'p® n n t ,. (18.72) 

t' V 

The above equality holds because ^2 t , U t > = I and because p® n commutes with any type 
projector U t /. We would now like to show that the trace distance between the state in 
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(18.71 ) and p® n asymptotically converges to its maximum value of two when t ^ px, implying 



that entanglement concentration is a particularly bad method for Schumacher compression. 
Consider the following chain of inequalities: 



1 



-n tP ® n n t - p* 



Tr{n t p®"} 

r l - z n t p® n n t - V n t ,p® n n t , 
— — ! — r - i j n t p® n n t - V TL t ,p® n Ti t , 

Tr{n t p®"} J w 71 



(18.73) 
(18.74) 



Tr{Il t p® n } 

1 



Utp^Ut^ 



X>'P® n n*< 



v^t 



Tr{IT t p® n } 



Tr{n t p 0n } + J2 Tr{n t ,p 0n } 

= |l-Tr{n t p 0n }| +^Tr{n t ,p 0n }-Tr{n t p 0n } 

t' 
= 2(1-Tr{n t p^}) 

>2(l-2" nD (^l p )). 



(18 


75) 


(18 


76) 


(18 


77) 


(18.78) 


(18 


79) 



The first equality follows from (18.72). The second equality follows from straightforward 



algebra. The third equality follows because all of the type class subspaces are orthogonal to 
each other. The fourth equality follows because the operators U t 'p® n T[ t i are positive for all 
types t'. The last few equalities are straightforward, and the final inequality follows from the 



bound in Exercise 13.7.1. This shows that the state from the entanglement concentration is 
a very poor approximation in the asymptotic limit when the type distribution - is different 
from the distribution p from the spectral decomposition of p, implying a positive relative 
entropy Z)(-||p). On the other hand, suppose that the empirical distribution - is the same 
as the distribution p. Then, we can rewrite 2(1 — Tr{n t p® n }) as 



2(1 - Tr{ll t p® n }) =2^Tr{n t p 0n }. 



(18.80) 



v^t 



The resulting expression includes the probability mass from every type class besides t, and 
the probability mass of the type class t alone can never approach one asymptotically. It is a 
subspace smaller than the typical subspace because it does not include all of the other typical 
types, and such a result would thus contradict the optimality of Schumacher compression. 
Thus, this approximation is also poor in the asymptotic limit. □ 
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We also cannot use the technique from Schumacher compression (a typical subspace 
measurement) to perform entanglement concentration. It seems like it could be possible, 
given that the eigenvalues of the state resulting a typical subspace measurement are approx- 



imately uniform (recall the third property of typical subspaces — Property 14.1.3). That is, 
suppose that Alice and Bob share many copies of a state \4>) with Schimdt decomposition 

10) = E* yJpx(x)\x) A \x) B : 

(>rr= £ vWMK> A vr. (18.81) 

x n £X n 

Then a projection onto the typical subspace succeeds with high probability and results in a 
state of the following form: 

,y = m £ *M|.r>f. (18.82) 

Consider the following maximally entangled state on the typical subspace: 

— = Y \x n ) An \x n ) Bn . (18.83) 

VI 1 I x n eT xn 



It seems like the state in (18.82) should be approximately close to the maximally entangled 



state on the typical subspace because the probability amplitudes of the state in (18.82) are 
all ~ 2~ nH ( x >. But we can provide a simple counterexample to prove that this is not true. 
Suppose that we have the following two pure, bipartite states: 

|0} AS = ^|00} AS + ^l^p~\ll) AB , (18.84) 

\^) AB = ^T—p\W) AB + ^|11} AB , (18.85) 

where p is some real number strictly between zero and one and not equal to 1/2. Con- 
sider that the fidelity between these two states is equal to 2y/p(l — p), and observe that 
2y/p(l — p) < 1 for the values of p given above. Thus, the fidelity between n copies of these 
states is equal to (2y/p(l — p)) n and approaches zero in the asymptotic limit for the values of 
p given above. Suppose that Alice performs a typical subspace measurement on ma ny copi es 



of the state \ip) , and let ifi' denote the resulting state (it is a state of the form in (Il8.82|)). 



Let $ denote the maximally entangled state on the typical subspace of (\ip) ) (g)n . Now 
suppose for a contradiction that the trace distance between the maximally entangled state $ 
and the projected state tp' becomes small as n becomes large (i.e., suppose that Schumacher 
compression performs well for the task of entanglement concentration). Also, consider that 
the typical subspace of (\<f>) )® n is the same as the typical subspace of (\ip) )® n if we em- 
ploy a typical subspace measurement with respect to entropic typicality (weak typicality). 



Let (f)' denote the typically projected state. We can then apply the result of Exercise 9.2.3 
twice and the triangle inequality to bound the fidelity between the maximally entangled 
state $ and the state 0®": 
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F (</>',$) <F (</>', 1> 9n ) + \\$-if> 9n \\ 1 (18.86) 

< F(<f)® n , V 0n ) + \\(f>' - ^ n \\ ± + \\tfi® n - ^\\ x + ||$ - 0% (18.87) 

< (2^p{l-p)Y + 2^ + 2^ + e. (18.88) 
The second inequality follows because a typical subspace measurement succeeds with prob- 



ability 1 — e and the Gentle Measurement Lemma (Lemma 9.4.1) implies that the trace dis- 
tances \\4>' — 4> (S ' n \\ 1 and \\ifj® n — ijj'Wi are each less than 2^/e. The bound on ||$ — ijj'\\ 1 follows 
from the assumption that Schumacher compression is successful at entanglement concentra- 
tion. Then taking n to be sufficiently large guarantees that the fidelity F{(f)\ $) becomes 
arbitrarily small. But this result contradicts the assumption that Schumacher compression 
is successful at entanglement concentration because the typical subspace measurement is 
the same for the state 4>® n . That is, we are led to a contradication because <f>' = tf)' when 
the typical subspace measurement is with respect to entropic typicality. Thus, the trace 
distance ||$ — ip'\\ x cannot become arbitrarily small for large n, implying that Schumacher 
compression does not perform well for the task of entanglement concentration. 

18.6 Concluding Remarks 

Entanglement concentration was one of the earliest discovered protocols in quantum Shannon 
theory. The protocol exploits one of the fundamental tools of classical information theory 
(the method of types), but it applies the method in a coherent fashion so that a type class 
measurement learns only the type and nothing more. The protocol is similar to Schumacher 
compression in this regard (in that it learns only the necessary information required to 
execute the protocol and preserves coherent superpositions), and we will continue to see this 
idea of applying classical techniques in a coherent way in future quantum Shannon-theoretic 
protocols. For example, the protocol for quantum communication over a quantum channel 
is a coherent version of a protocol to transmit private classical information over a quantum 
channel. Despite the similarity of entanglement concentration to Schumacher compression 
in the aforementioned regard, the protocols are fundamentally different, leading to a failure 
of the intended information processing task if one protocol is exploited to accomplish the 
information processing task of the other. 

18.7 History and Further Reading 

Elias constructed a protocol for randomness concentration in an early paper [90J. Bennett et 
al. offered two different protocols for entanglement concentration (one of which we developed 
in this chapter) |21j . Nielsen later connected entanglement concentration protocols to the 
theory of majorization |195] . Lo and Popescu studied entanglement concentration and the 
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classical communication cost of the inverse protocol (entanglement dilution) |188~| I187] . Hay- 
den and Winter further elaborated on the communication cost of entanglement dilution [135j, 
as did Harrow and Lo |124] . Kaye and Mosca developed practical networks for entanglement 
concentration [169] . and recently, Blume-Kohout et al. took this line of research a step fur- 
ther by considering streaming protocols for entanglement concentration [40| . Hayashi and 
Matsumoto also developed protocols for universal entanglement concentration |129j . 
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Before quantum information theory became an established discipline in its own right, 
John R. Pierce issued the following quip at the end of his 1973 retrospective article on the 
history of information theory |204| : 

"I think that I have never met a physicist who understood information theory I 
wish that physicists would stop talking about reformulating information theory 
and would give us a general expression for the capacity of a channel with quantum 
effects taken into account rather than a number of special cases." 

Since the publication of Pierce's article, we have learned much more about quantum 
mechanics and information theory than he might have imagined at the time, but we have 
also realized that there is much more to discover. In spite of all that we have learned, we 
still unfortunately have not been able to address Pierce's concern in the above quote in full 
generality. 

The most basic question that we could ask in quantum Shannon theory (and the one 
with which Pierce was concerned) is how much classical information can a sender transmit 
to a receiver by exploiting a quantum channel. We have determined many special cases of 
quantum channels for which we do know their classical capacities, but we also now know 
that this most basic question is still wide open in the general case. 

What Pierce may not have imagined at the time is that a quantum channel has a much 
larger variety of capacities than does a classical channel. For example, we might wish to de- 
termine the classical capacity of a quantum channel assisted by entanglement shared between 
the sender and receiver. We have seen that in the simplest of cases, such as the noiseless 
qubit channel, shared entanglement boosts the classical capacity up to two bits, and we now 
refer to this phenomenon as the super-dense coding effect (see Chapter [6]). Interestingly, the 
entanglement-assisted capacity of a quantum channel is one of the few scenarios where we 
can claim to have a complete understanding of the channel's transmission capabilities. From 
the results regarding the entanglement-assisted capacity, we have learned that shared entan- 
glement is often a "friend" because it tends to simplify results in both quantum Shannon 
theory and other subfields of quantum information science. 

Additionally, we might consider the capacity of a quantum channel for transmitting quan- 
tum information. In 1973, it was not even clear what was meant by "quantum information," 
but we have since been able to formulate what it means, and we have been able to char- 
acterize the quantum capacity of a quantum channel. The task of transmitting quantum 
information over a quantum channel bears some similarities with the task of transmitting 
private classical information over that channel, where we are concerned with keeping the clas- 
sical information private from the environment of the channel. This connection has given 
insight for achieving good rates of quantum communication over a noisy quantum channel, 
and there is even a certain class of channels for which we already have a good expression 
for the quantum capacity (the expression being the coherent information of the channel). 
Though, the problem of determining a good expression for the quantum capacity in the 
general case is still wide open. 

The remaining chapters of the book are an attempt to summarize many items the quan- 
tum information community has learned in the past few decades, all of which are an attempt 
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to address Pierce's concern in various ways. The most important open problem in quantum 
Shannon theory is to find better expressions for these capacities so that we can actually 
compute them for an arbitrary quantum channel. 
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Classical Communication 



This chapter begins our exploration of "dynamic" information processing tasks in quan- 
tum Shannon theory, where the term "dynamic" indicates that a quantum channel connects 
a sender to a receiver and their goal is to exploit this resource for communication. We 
specifically consider the scenario where a sender Alice would like to communicate classical 
information to a receiver Bob, and the capacity theorem that we prove here is one particular 
generalization of Shannon's noisy channel coding theorem from classical information theory 



(overviewed in Section 2.2). In later chapters, we will see other generalizations of Shannon's 
theorem, depending on what resources are available to assist their communication or depend- 
ing on whether they are trying to communicate classical or quantum information. For this 
reason and others, quantum Shannon theory is quite a bit richer than classical information 
theory. 

The naive approach to communicate classical information over a quantum channel is for 
Alice and Bob simply to mimic the approach used in Shannon's noisy channel coding theorem. 
That is, they select a random classical code according to some distribution px{x), and Bob 
performs individual measurements of the outputs of a noisy quantum channel according to 
some POVM. The POVM at the output induces some conditional probability distribution 
PY\x{y\x)i which we can in turn think of as an induced noisy classical channel. The classical 
mutual information I(X;Y) of this channel is an achievable rate for communication, and 
the best strategy for Alice and Bob is to optimize the mutual information over all of Alice's 
inputs to the channel and over all measurements that Bob could perform at the output. 
The resulting quantity is equivalent to Bob's optimized accessible information, which we 



previously discussed in Section 10.8.2 



If the aforementioned coding strategy were optimal, then there would not be anything 
much interesting to say for the information processing task of classical communication (in 



fact, there would not be any need for all of the tools we developed in Chapters 14 and 15 



This is perhaps one first clue that the above strategy is not necessarily optimal. Furthermore, 



we know from Chapter [TT] that the Holevo information is an upper bound to the accessible 
information, and this bound might prompt us to wonder if it is also an achievable rate for 
classical communication, given that the accessible information is achievable. 
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The main theorem of this chapter is the classical capacity theorem (also known as the 
Holevo- Schumacher- Westmoreland theorem), and it states that the Holevo information of a 
quantum channel is an achievable rate for classical communication. The Holevo information 
is easier to manipulate mathematically than is the accessible information. The proof of its 
achievability demonstrates that the aforementioned strategy is not optimal, and the proof 
also shows how performing collective measurements over all of the channel outputs allows 
the sender and receiver to achieve the Holevo information as a rate for classical communi- 
cation. Thus, this strategy fundamentally makes use of quantum-mechanical effects at the 
decoder and suggests that such an approach is necessary to achieve the Holevo information. 
Although this strategy exploits collective measurements at the decoder, it does not make use 
of entangled states at the encoder. That is, the sender could input quantum states that are 
entangled across all of the channel inputs, and this encoder entanglement might potentially 
increase classical communication rates. 

One major drawback of the classical capacity theorem (also the case for many other 
results in quantum Shannon theory) is that it only demonstrates that the Holevo information 
is an achievable rate for classical communication — the converse theorem is a "multi-letter" 
converse, meaning that it might be necessary in the general case to evaluate the Holevo 
information over a potentially infinite number of uses of the channel. The multi-letter nature 
of the capacity theorem implies that the optimization task for general channels is intractable 
and thus further implies that we know very little about the actual classical capacity of general 
quantum channels. Now, there are many natural quantum channels such as the depolarizing 
channel and the dephasing channel for which the classical capacity is known (the Holevo 
information becomes "single-letter" for these channels), and these results imply that we 
have a complete understanding of the classical information transmission capabilities of these 
channels. All of these results have to do with the additivity of the Holevo information of a 



quantum channel, which we studied previously in Chapter 12 

We mentioned that the Holevo-Schumacher- Westmoreland coding strategy does not make 
use of entangled inputs at the encoder. But a natural question is to wonder whether en- 
tanglement at the encoder could boost classical information transmission rates, given that 
it is a resource for many quantum protocols. This question was known as the additivity 
conjecture and went unsolved for many years, but recently Hastings offered a proof that 
entangled inputs can increase communication rates for certain channels. Thus, for these 
channels, the single-letter Holevo information is not the proper characterization of classical 
capacity (though, this is not to say that there could be some alternate characterization of the 
classical capacity other than the Holevo information which would be single- letter). These 
recent results demonstrate that we still know little about classical communication in the 
general case and furthermore that quantum Shannon theory is an active area of research. 

We structure this chapter as follows. We first discuss the aforementioned naive strategy 
in detail, so that we can understand the difference between it and the Holevo-Schumacher- 



Westmoreland strategy. Section 19.2 describes the steps needed in any protocol for classical 



communication over a quantum channel. Section 19.3 provides a statement of the classical 



capacity theorem, and its two subsections prove the corresponding direct coding theorem and 
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Figure 19.1: The most naive strategy for Alice and Bob to communicate classical information over many 
independent uses of a quantum channel. Alice wishes to send some message M and selects some tensor 
product state to input to the channel conditional on the message M . She transmits the codeword over the 
channel, and Bob then receives a noisy version of it. He performs individual measurements of his quantum 
systems and produces some estimate M' of the original message M . This scheme is effectively a classical 
scheme because it makes no use of quantum-mechanical features such as entanglement. 



the converse theorem. The direct coding theorem exploits two tools: quantum typicality from 



Chapter 14 and the packing lemma from Chapter 15 The converse theorem exploits two tools 
from Chapter 11 continuity of entropies (the Alicki-Fannes' inequality) and the quantum 



data processing inequality. We then detail how to calculate the classical capacity of several 
exemplary channels such as entanglement-breaking channels, quantum Hadamard channels, 
and depolarizing channels — these are channels for which we have a complete understanding of 
their classical capacity. Finally, we end with a discussion of the recent proof that the Holevo 
information can be superadditive (that is, entangled inputs at the encoder can enhance 
classical communication rates for some channels). 



19.1 Naive Approach: Product Measurements at the 
Decoder 



We begin by discussing in more detail the most naive strategy that a sender and receiver can 
exploit for the transmission of classical information over many uses of a quantum channel. 
Figure 19.1 depicts this naive approach. This first approach mimics certain features of 
Shannon's classical approach without making any use of quantum-mechanical effects. Alice 
and Bob agree on a codebook beforehand, where each classical codeword x n (m) in the 
codebook corresponds to some message m that Alice wishes to transmit. Alice can exploit 
some alphabet {p x } of density operators to act as input to the quantum channel. That is, 
the quantum codewords are of the form 



Px n {m) = Px x (m) ® Px 2 (m) 



Px n (m)- 



(19.1) 
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Figure 19.2: A coding strategy that can outperform the previous naive strategy, simply by making use of 
entanglement at the encoder and decoder. 



Bob then performs individual measurements of the outputs of the quantum channel by 
exploiting some POVM {A,,}. This scheme induces the following conditional probability 
distribution: 



PY 1 -Y n \x 1 -x n {yi ■■■Vn I xi(ra)- -x n {m)) 
= TrJA^ <g> • • • <g> A Vn (J\f <2> • • • <8> Af) (p Xl ( m ) ® ■ ■ ■ ® px n (m)) } 
= Tr{(A yi ® • • • ® A yn ){Af{p Xl {m)) ® • • • <8 Af{pxn(m))) } 



]lTr{A yi Af(p Xi{m) )}, 



(19.2) 
(19.3) 

(19.4) 



i=l 



which we immediately realize is many independent and identically distributed instances of 
the following classical channel: 



PY \x(y\x) = Tr{Ai(p x )A y }. 



(19.5) 



Thus, if they exploit this scheme, the optimal rate at which they can communicate is equiv- 
alent to the following expression: 



h cc (Af)= max I(X;Y), 

{Px{x),PxA} 



(19.6) 



where the maximization of the classical mutual information is over all input distributions, 
all input density operators, and all POVMs that Bob could perform at the output of the 
channel. This information quantity is known as the accessible information of the channel. 

The above strategy is not necessarily an optimal strategy if the channel is truly a quantum 
channel — it does not make use of any quantum effects such as entanglement. A first simple 
modification of the protocol to allow for such effects would be to consider coding for the 
tensor product channel M <S> Af rather than the original channel. The input states would be 
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entangled across two channel uses, and the output measurements would be over two channel 
outputs at a time. In this way, they would be exploiting entangled states at the encoder and 



collective measurements at the decoder. Figure [19. 2| illustrates the modified protocol, and the 
rate of classical communication that they can achieve with such a strategy is \l asx {N <8> A/"). 
This quantity is always at least as large as I acc {J\f) because a special case of the strategy for 
the tensor product channel TV® TV is to choose the distribution px{%), the states p x , and the 
POVM A to be tensor products of the ones that maximize iacc(TV). We can then extend this 
construction inductively by forming codes for the tensor product channel J\f® k (where k is 
a positive integer), and this extended strategy achieves the classical communication rate of 
^acclTV®^) for any finite k. These results then suggest that the ultimate classical capacity 
of the channel is the regularization of the accessible information of the channel: 

/reg,acc(TV) = lim l/ acc (TV^). (19.7) 

fc^oo K 

The regularization of the accessible information is intractable for general quantum chan- 
nels, but the optimization task could simplify immensely if the accessible information is 



additive (additive in the sense of Chapter 12). In this case, the regularized accessible infor- 



mation ireg,acc(A/") would be equivalent to the accessible information lace (TV). Though, even 

if the quantity is additive, the optimization could still be difficult to perform in practice. 

A simple upper bound on the accessible information is the Holevo information %(TV) of the 

channel, defined as 

X(A0 = maxI(X;B), (19.8) 

p 

where the maximization is over classical-quantum states p XB of the following form: 

p XB = Y,Px{x)\x)(x\ x ®M A '- B (^). (19.9) 

x 

The Holevo information is a more desirable quantity to characterize classical communication 
over a quantum chann el beca use it is always an upper bound on the accessible information 

states that it is sufficient to consider pure states ip£ at the 



12.3.2 



and because Theorem 

channel input for maximizing the Holevo information. 

Thus, a natural question to ask is whether Alice and Bob can achieve the Holevo in- 
formation rate, and the main theorem of this chapter states that it is possible to do so. 
The resulting coding scheme bears some similarities with the techniques in Shannon's noisy 
channel coding theorem, but the main difference is that the decoding POVM is a collective 
measurement over all of the channel outputs. 

19.2 The Information Processing Task 

19.2.1 Classical Communication 

We now discuss the most general form of the information processing task and give the 
criterion for a classical communication rate C to be achievable — i.e., we define an (n, C — 5, e) 
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Alice 



Bob 




Figure 19.3: The most general protocol for classical communication over a quantum channel. Alice selects 
some message M and encodes it as a quantum codeword for input to many independent uses of the noisy 
quantum channel. Bob performs some POVM over all of the channel outputs to determine the message that 
Alice transmits. 



code for classical communication over a quantum channel. Alice begins by selecting some 
classical message m that she would like to transmit to Bob — she selects from a set of messages 
{1, . . . , |A4|}. Let M denote the random variable corresponding to Alice's choice of message, 
and let \M.\ denote its cardinality. She then prepares some state p^'" as input to the many 
independent uses of the channel — the input systems are n copies of the channel input system 
A'. She transmits this state over n independent uses of the channel A/", and the state at 
Bob's receiving end is 

A^(pf)- (19-10) 

Bob has some decoding POVM {A m } that he can exploit to determine which message Alice 



transmits. Figure |19.3| depicts such a general protocol for classical communication over a 
quantum channel. 

Let M' denote the random variable for Bob's estimate of the message. The probability 
that he determines the correct message m is as follows: 



Pr{M = m\M' = m} = TrU m U m {p^ n \ }, 
and thus the probability of error for a particular message m is 

Peirn) = 1 - Pr{M = m \ M' = m} 

= Tr{(/-A m )A^(pf)}- 

The maximal probability of error for any coding scheme is then 

Pe 



The rate C of communication is 



maxp e (m). 



C=-log 2 \M\+5, 
n 



(19.11) 

(19.12) 
(19.13) 

(19.14) 
(19.15) 
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where 5 is some arbitrarily small positive number, and the code has e error if p* < e. A rate 
C of classical communication is achievable if there exists an (n, C — S, e) code for all 5, e > 
and sufficiently large n. 

19.2.2 Common Randomness Generation 

A sender and receiver can exploit a quantum channel for the alternate but related task of 
common randomness generation. Here, they only wish to generate uniform shared random- 
ness of the form: 

., \M\ 

<P = — — - >|m)(m| <S> \m)\m\ . (19.16) 

m=l 

Such shared randomness is not particularly useful as a resource, but this viewpoint is help- 
ful for proving the converse theorem of this chapter and later on when we encounter other 
information processing tasks in quantum Shannon theory. The main point to note is that a 
noiseless classical bit channel can always generate one bit of noiseless common randomness. 
Thus, if a quantum channel has a particular capacity for classical communication, it can 
always achieve the same capacity for common randomness generation. In fact, the capacity 
for common randomness generation can only be larger than that for classical communication 
because common randomness is a weaker resource than classical communication. This rela- 
tionship gives a simple way to bound the capacity for classical communication from above 
by the capacity for common randomness generation. 

The most general protocol for common randomness generation is as follows. Alice begins 



by locally preparing a state of the form in (19.16). She then performs an encoding map that 
transforms this state to the following one: 

\M\ 



^2\m){ 



m\ M ®fC, (19-17) 

m=l 

and she transmits the A' n systems over the noisy quantum channel, producing the following 

state: 

\M\ 

ENH , '®^(rij. (19-is) 

m=l 

Bob then performs a quantum instrument on the received systems (exploiting some POVM 
{A m }), and the resulting state is 



Y, E l™>H M ® VKJj^ 9n (fC)VK^' ® \m'){m'f. (19.19) 

m=l m'=l 

The state uj mm should then be e-close in trace distance to the original state in (19.16) if the 
protocol is good for common randomness generation: 



$ MM ' _ ^M' 



1 



< e. (19.20) 
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A rate C for common randomness generation is achievable if there exists an (n, C — 5,e) 
common randomness generation code for all S, e > and sufficiently large n. 

19.3 The Classical Capacity Theorem 

We now state the main theorem of this chapter, the classical capacity theorem. 

Theorem 19.3.1 (Holevo-Schumacher- Westmoreland). The classical capacity of a quantum 
channel is the supremum over all achievable rates, and one characterization of it is the 
regularization of the Holevo information of the channel: 

sup{C | C is achievable} = Xreg{N)i (19.21) 

where 

X m W = lim \x(N m ), (19-22) 



and the Holevo information x{N) of a channel N is defined in (19.8). 



The regularization in the above characterization is a reflection of our ignorance of a better 
formula for the classical capacity of a quantum channel. The proof of the above theorem 
in the next two sections demonstrates that the above quantity is indeed equal to the clas- 
sical capacity, but the regularization implies that the above characterization is intractable 
for general quantum channels. Though, if the Holevo information of a particular channel is 



additive (in the sense discussed in Chapter 12), then x reg (A/') = x(Af), the classical capacity 



formula simplifies for such a channel, and we can claim to have a complete understanding 
of the channel's classical transmission capabilities. This "all-or-nothing" situation with ca- 
pacities is quite common in quantum Shannon theory, and it implies that we still have much 
remaining to understand about classical information transmission over quantum channels. 

The next two sections prove the above capacity theorem in two parts: the direct coding 
theorem and the converse theorem. The proof of the direct coding theorem demonstrates the 



inequality LHS > RHS in (19.21). That is, it shows that the regularized Holevo information 
is an achievable rate for classical communication, and it exploits typical and conditionally 
typical subspaces and the Packing Lemma to do so. The proof of the converse theorem shows 



the inequality LHS < RHS in (19.21). That is, it shows that any protocol with achievable 
rate C (with vanishing error in the large n limit) should have its rate below the regularized 
Holevo information. The proof of the converse theorem exploits the aforementioned idea of 
common randomness generation, continuity of entropy, and the quantum data processing 
inequality. 

19.3.1 The Direct Coding Theorem 

We first prove the direct coding theorem. Suppose that a noisy channel M connects Alice to 
Bob, and they are allowed access to many independent uses of this quantum channel. Alice 
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can choose some ensemble {p X {x), p x } of states which she can exploit to make a random code 
for this channel. She selects \M.\ codewords {x n {m)} m& r l < M n independently according to 
the following distribution: 

J /( x ») = i [Ex-gTf" Px-{x n )\ Px-{x n ) :x n ETf ^ (1Q23) 

( : x n £ Tf " 

where X' n is a random variable selected according to the distribution p' X m{x n ), px n (x n ) = 
Px{xi) ■ ■ •Px{x n )i and T^ n denotes the set of strongly typical sequences for the distribution 



Px n {x n ) (see Section 13.7). This "pruned" distribution is approximately close to the IID 
distribution px n (% n ) because the probability mass of the typical set is nearly one (the next 
exercise asks you to make this intuition precise). 

Exercise 19.3.1 Prove that the trace distance between the pruned distribution p' X m.(x n ) 
and the IID distribution p X n(x n ) is small for all sufficiently large n: 

Y, \ Px , n (x n )- Pxn (x n )\<2e, (19.24) 

x n eX n 

where e is an arbitrarily small positive number such that PrjX™ G T*"} > 1 — e. 

These classical codewords {x n {m)} m€ r 1 , M a lead to quantum codewords of the following 
form: 

Px"(m) = P Xl {m) <8) • • • <8) Px n (m), (19.25) 

by exploiting the quantum states in the ensemble {px(x), p x }- Alice then transmits these 
codewords through the channel, leading to the following tensor product density operators: 

<Tx n (m) = ^i(m) ® • • • <8> &x n (m) (19.26) 

= J^{px 1 (m)) ® • • • ®N(px n (m))- (19.27) 

Bob then detects which codeword Alice transmits by exploiting some detection POVM {A m } 
that acts on all of the channel outputs. 



At this point, we would like to exploit the Packing Lemma (Lemma 15.3.1 from Chap- 
ter 15). Recall that four objects are needed to apply the Packing Lemma, and they should 
satisfy four inequalities. The first object needed is an ensemble from which we can select 
a code randomly, and the ensemble in our case is {p x , n (x n ),a x n}. The next object is the 
expected density operator of this ensemble: 

E x ,n{a x ,n}= J2 p x ,n(x n )a x n. (19.28) 

x n ex n 

Finally, we need a message subspace projector and a total subspace projector, and we let 
these respectively be the conditionally typical projector L T (5 for the state a x n and the 

typical projector Uf n for the tensor product state a® n where a = J2 x Px(x)a x . Intuitively, 
the tensor product state a® n should be close to the expected state "K X in{a X m} ) and the next 
exercise asks you to verify this statement. 
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Exercise 19.3.2 Prove that the trace distance between the expected state Ex'»{ffx«>} and 
the tensor product state a® n is small for all sufficiently large n: 



\\R x , n {o x , n }-o® n \\ l <2e, 
where e is an arbitrarily small positive number such that Pr{X n G T*"} > 1 — e. 



;i9.29) 



If the four conditions of the Packing Lemma are satisfied (see ( 15.11|fl5.14 )), then there 
exists a coding scheme with a detection POVM that has an arbitrarily low maximal proba- 
bility of error as long as the number of messages in the code is not too high. We now show 
how to satisfy these four conditions by exploiting the properties of typical and condition- 
ally typical projectors. The following three conditions follow from the properties of typical 
subspaces: 



iiK B :nf}>i-e, 

Trjnf lx ' n \ < 2 n ^ B ^ +5 \ 



(19.30) 
(19.31) 

(19.32) 



The first inequality follows from Property 14.2.7 The second inequality follows from Prop 



erty 14.2.4, and the third from Property 14.2.5. We leave the proof of the fourth inequality 



for the Packing Lemma as an exercise. 

Exercise 19.3.3 Prove that the following inequality holds 

nf E x ,n{a x >n}nf n < [l-ej-V^^-^nf . 

(Hint: First show that Ex™{cx m } < [1 _ e]~ v 3 " and then apply the third property of 



(19.33) 



typical subspaces — Property 14.1.3). 



With these four conditions holding, it follows from Corollary 15.5.1 (the derandomized 
version of the Packing Lemma) that there exists a deterministic code and a POVM {A m } 
that can detect the transmitted states with arbitrarily low maximal probability of error as 
long as the size \A4\ of the message set is small enough: 

p* e = maxTr{(/ - k m )N® n {p xHm) )} (19.34) 

<4(e + 2^) + 8[l-e]- 1 2-^ H ^- H ^ x ^ 2 ^\M\ (19.35) 

= 4(e + 2^7) + 8[1 - t]- l 2-< I{x > B) - 26) \M\. (19.36) 

So, we can choose the size of the message set to be \Ai\ = 2 ra(7 ( X; - B ) _3<5 ) so that the rate of 
communication is the Holevo information I(X; B): 



1 



n 



\og 2 \M\=I(X;B)-3S, 



(19.37) 
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and the bound on the maximal probability of error becomes 

p:<4(e + 2^)+8[l-e]- 1 2- n<5 . (19.38) 

Since e is an arbitrary positive number that approaches zero for sufficiently large n and 
5 is an arbitrarily small positive constant, the maximal probability of error vanishes as 
n becomes large. Thus, the Holevo information I(X; B) , with respect to the following 
classical-quantum state 

P XB = ^p x {x)\x){x\ X ®N{ Px ), (19.39) 

is an achievable rate for the transmission of classical information over TV. 

Alice and Bob can achieve the Holevo information x(W) of the channel M simply by 
selecting a random code according to the ensemble {px{x), p x } that maximizes I(X;B). 
Lastly, they can achieve the rate |x(.A/*® fe ) by coding instead for the tensor product channel 
A/"® fc , and this last result implies that they can achieve the regularization Xreg(W) by making 
the blocks for which they are coding be arbitrarily large. 

We comment more on the role of entanglement at the encoder before moving on to the 
proof of the converse theorem. First, the above coding scheme for the channel M does not 
make use of entangled inputs at the encoder because the codeword states /V(m) are separable 
across the channel inputs. It is only when we code for the tensor product channel J\f® k that 
entanglement comes into play. Here, the codeword states are of the form: 

Px«(m) = pC(m) ® * • • ® pC(m)- (19.40) 

a fk 

That is, the states p^./ m \ exist on the Hilbert space of k channel inputs and can be entan- 
gled across these k systems. Whether entanglement at the encoder could increase classical 



communication rates over general quantum channels (whether the regularization in (19.21) 
is really necessary for the general case) was the subject of much intense work over the past 
few years, but a recent result has demonstrated the existence of a channel for which ex- 
ploiting entanglement at the encoder is strictly better than not exploiting entanglement (see 



Section 19.5). 



It is worth re-examining the proof of the Packing Lemma (Lemma 15.3.1) in order to 
understand better the decoding POVM at the receiving end. The particular decoding POVM 
elements employed in the Packing Lemma have the following form: 

/ \M\ \~» / \M\ \~» 

A m = \J2 r m > r m \J2 *V , (i9.4i) 

\m'=l I \m'=l J 

r m ^Hfnf l ^ (m) Hf. (19.42) 

(Simply substitute the conditionally typical projector 11^ and the typical projector Uf" 



into (15.24).) A POVM with the above elements is known as a "square-root" measurement 



because of its particular form. We employ such a measurement at the decoder because it 
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has nice analytic properties that allow us to obtain a good bound on the expectation of 
the average error probability (in particular, we can exploit the operator inequality from 



Exercise 15.4.1). This measurement is a collective measurement because the conditionally 



typical projector and the typical projector are both acting on all of the channel outputs, and 
we construct the square-root measurement from these projectors. Such a decoding POVM is 



far more exotic than the naive strategy overviewed in Section |19.1| where Bob measures the 
channel outputs individually — it is for the construction of this decoding POVM and the proof 
that it is asymptotically good that Holevo, Schumacher, and Westmoreland were given much 
praise for their work. Though, there is no known way to implement this decoding POVM 
efficiently, and the original efficiency problems with the decoder in the proof of Shannon's 
noisy classical channel coding theorem plague the decoders in the quantum world as well. 

Exercise 19.3.4 Prove the direct coding theorem of HSW without applying the Packing 
Lemma (but you can use similar steps as in the Packing Lemma). 

Exercise 19.3.5 Show that a measurement with POVM elements of the following form is 
sufficient to achieve the Holevo information of a quantum channel: 



A, 




B n \x n {m') 
6 



n 



B n \x n (m) 



\M\ 

xp jjB n \x n (m') 

m'=l 



;i9.43) 



19.3.2 The Converse Theorem 

The second part of the classical capacity theorem is the converse theorem, and we provide 
a simple proof of it in this section. Suppose that Alice and Bob are trying to accomplish 
common randomness generation rather than classical communication — the capacity for such 
a task can only be larger than that for classical communication as we argued before in 
Recall that in such a task, Alice first prepares a maximally correlated state 
5 of common randomness generation is equal to - log 2 |M|. Alice 
and Bob share a state of the form in (19.19) after encoding, channel transmission, and 
decoding. We now show that the regularized Holevo information in (19.22) bounds the rate 



Section 19.2 



-jrMM' 

<P so that the rate G 



of common randomness generation for any protocol that has vanishing error in the asymptotic 



limit (the error criterion is in (19.20)). As a result, the regularized Holevo information also 
upper bounds the capacity for classical communication. Consider the following chain of 
inequalities: 



n(C 



5) = I(M;M% 

< I(M; M') u + ne' 
<I{M;B n )^ne' 

< x{M m ) 



ne 



(19.44) 
(19.45) 
(19.46) 
(19.47) 



The first equality follows because the mutual information of the common randomness state 
$ is equal to n(C — S) bits. The first inequality follows from the error criterion in 
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(19.20) and by applying the Alicki-Fannes' inequality for quantum mutual information (Ex- 
ercise 11.9.7) with e' = 6eC + 4H2(e)/n. The second inequality results from the quantum 
data processing inequality for quantum mutual information (Corollary 11.9.4) — recall that 
Bob processes the B n system with a quantum instrument to get the classical system M' . 
Also, the quantum mutual information is evaluated on a classical-quantum state of the form 



in (19.18). The final inequality follows because the classical- quantum state in (19.18) has a 



particular distribution and choice of states, and this choice always leads to a value of the 
quantum mutual information that cannot be greater than the Holevo information of the 
tensor product channel M® n . 



19.4 Examples of Channels 



Observe that the final upper bound in (19.47) on the rate C is the multi-letter Holevo 



information of the channel. It would be more desirable to have x(N) as the upper bound 
on C rather than -x{N® n ) because the former is simpler, but the optimization problem 
set out in the latter quantity is simply impossible to compute with finite computational 



resources. Though, the upper bound in (19.47) is the best known upper bound if we do not 



know anything else about the structure of the channel, and for this reason, the best known 



characterization of the classical capacity is the one given in (19.21). 



If we know that the Holevo information of the tensor product of a certain channel with 
itself is additive, then there is no need for the regularization Xreg(.A/*)j and the characteriza- 



tion in Theorem 19.3.1 reduces to a very good one: the Holevo information x(AT). There are 
many examples of channels for which the classical capacity reduces to the Holevo information 
of the channel, and we detail three such classes of examples in this section: the cq channels, 
the quantum Hadamard channels, and the quantum depolarizing channels. The proof that 
demonstrates additivity of the Holevo information for each of these channels depends explic- 
itly on structural properties of each one, and there is unfortunately not much to learn from 
these proofs in order to say anything about additivity of the Holevo information of general 
quantum channels. Nevertheless, it is good to have some natural channels for which we can 
compute their classical capacity, and it is instructive to examine these proofs in detail to 
understand what it is about each channel that makes their Holevo information additive. 



19.4.1 Classical Capacity of Classical-Quantum Channels 



Recall from Section |4.4.6| that a quantum channel is a particular kind of entanglement- 
breaking channel (cq channel) if the action of the channel is equivalent to performing first 
a complete von Neumann measurement of the input and then preparing a quantum state 
conditional on the value of the classical variable resulting from the measurement. We have 



already seen in Section 12.3.1 that the Holevo information of these channels is additive. 



Additionally, Theorem |12.3.3| states that the Holevo information is a concave function of 
the input distribution over which we are optimizing for such channels. Thus, computing 
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the classical capacity of cq channels channel can be performed by optimization techniques 
because its Holevo information is additive. 

The Relation to General Channels 

We can always exploit the above result regarding cq entanglement-breaking channels to get a 
reasonable lower bound on the classical capacity of any quantum channel TV. The sender Alice 
can simulate an entanglement- breaking channel by modifying the processing at the input of 
an arbitrary quantum channel. She can first measure the input to her simulated channel 
in the basis {|x)(x|}, prepare a state p x conditional on the outcome of the measurement, 
and subsequently feed this state into the channel TV. These actions are equivalent to the 
following map: 

a -► ^(x\a\x)Af(p x ), (19.48) 

X 

and the capacity of this simulated channel is equal to 

I(X;B) p , (19.49) 



where 

P XB 



Y,Px(x)\x)(x\ x ®Af(p x ), (19.50) 



Px{x) = (x\a\x). (19.51) 

Of course, Alice has the freedom to prepare whichever state a she would like to be input to the 
simulated channel, and she also has the ability to prepare whichever states p x she would like 
to be conditional on the outcomes of the first measurement, so we should let her maximize the 
Holevo information over all these inputs. Thus, the capacity of the entanglement- breaking 
channel composed with the actual channel is equivalent to the Holevo information of the 
original channel: 

max I(X;B) . (19.52) 

Px(x),Px P 

This capacity is also known as the product-state capacity of the channel because it is the 
capacity achieved by inputting unentangled, separable states at the encoder (Alice can in fact 
just input product states), and it can be a good lower bound on the true classical capacity 
of a quantum channel, even if it does not allow for entanglement at the encoder. 

19.4.2 Classical Capacity of Quantum Hadamard Channels 



Recall from Section 5.2.4| that quantum Hadamard channels are those with a complemen- 



tary channel that is entanglement-breaking, and this property allows us to prove that the 
Holevo information of the original channel is additive. Several important natural channels 
are quantum Hadamard channels. A trivial example is the noiseless qubit channel because 
Bob could perform a von Neumann measurement of his system and send a constant state 
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to Eve. A less trivial example of a quantum Hadamard channel is a generalized dephasing 



channel (see Section 5.2.3), though this channel trivially has a maximal classical capacity 
of log 2 d bits per channel use because this channel transmits a preferred orthonormal basis 
without error. A quantum Hadamard channel with a more interesting classical capacity is 
known as a cloning channel, the channel induced by a universal cloning machine (though we 
will not discuss this channel in any detail). 

Theorem 19.4.1. The Holevo information of a quantum Hadamard channel Mr and any 
other channel M is additive: 

X(N H ®N) = xWh) + X(N). (19.53) 



Proof. First, recall from Theorem 12.3.2 that it is sufficient to consider ensembles of pure 
states at the input of the channel when maximizing its Holevo information. That is, we only 
need to consider classical-quantum states of the following form: 

a XA ' = J2px(x)\x)(x\ x ® \<j> x )(<f> x \ A ', (19.54) 

X 

where A' is the input to some channel Af A '^ B . Let uj XBE = U$" BE (o- XA ') where U$~* BE is 
an isometric extension of the channel. Thus, the Holevo information of M A ^ B is equivalent 
to a different expression: 

XW= max I(X;B) U (19.55) 

= max[H(B) u -H(B\X)J (19.56) 

= m a x[H(B) w -H(E\X)J, (19.57) 

where the second equality follows from the definition of the quantum mutual information, 
and the third equality follows because, conditional on X , the input to the channel is pure 
and the entropies H(B\X) U and H(E\X) U are equal. 

Exercise 19.4.1 Prove that it is sufficient to consider pure state inputs when maximizing 
the following entropy difference over classical-quantum states: 

m ax [H(B) uj -H(E\X)J. (19.58) 

a 

Suppose now that a is a state that maximizes the Holevo information of the joint channel 
•A/h <8> A/", and suppose it has the following form: 

a XA[A> 2 = J2px(x)\x)(x\ x ® \<j> x ){(t> x \ KA K (19-59) 

x 

Let 

u x Bi b 2 e iE2 _ (jjjfc*** g, U^ B2E2 )(a XA '^). (19.60) 
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Figure 19.4: A summary of the structural relationships for the additivity question if one channel is a 



quantum Hadamard channel. Alice first prepares a state of the form in (19.59). She transmits one system 



A'x through the quantum Hadamard channel and the other A' 2 through the other channel. The first Bob 
B\ at the output of the Hadamard channel can simulate the channel to the first Eve E\ because the first 
channel is a quantum Hadamard channel. He performs a von Neumann measurement of his system, leading 
to a classical variable Y, followed by the preparation of some state conditional on the value of the classical 
variable Y. The bottom of the figure labels the state of the systems at each step. 



The Hadamard channel is degradable, and the degrading map from Bob to Eve takes a 
particular form: it is a von Neumann measurement that produces a classical variable Y, 
followed by the preparation of a state conditional on the outcome of the measurement. Let 



yBi-y 



T>! x ^ be the first part of the degrading map that produces the classical variable Y, and let 



qXYE 1 B 2 E 2 



Vf 1 ^ r (u*" 1 * 2 ^ 2 ). Let £>2 be the second part of the degrading channel 
that produces the state of E\ con ditional on the classical variable Y, and let T XE ^ E ^ B ^ E ^ = 

— ' " : " |A ' t] L '" L '■ ' ' r ' ; "* " summarizes these structural relationships. Consider the 



V. 



Si (QXYEx 



Figure 



19.4 



following chain of inequalities: 

I(X; B!B 2 ) U = H(B 1 B 2 ) w - H(B 1 B 2 \X) w 
= H(B 1 B 2 ) uj -H(E 1 E 2 \X) uj 
KHiB^ + HiB^-HiE^Xl 
= H(B 1 ) u -H(E 1 \X) u + H(B 2 ) li 
KHiB^-HWX^ + H^ 

<x(A/" H ) + x(A0. 

The first equality follows from the definition of the quantum mutual information. The 
second equality follows because H(BiB 2 \X) uj = H(EiE 2 \X) uj when the conditional in- 
puts \4> x ) 1 2 to the channel are pure states. The next inequality follows from subad- 
ditivity of entropy H{B\B2) U1 < H(Bi) uj + H(B 2 ) oj and from the chain rule for entropy: 
H(EiE 2 \X) uj = H(Ei\X) u) + H(E2\EiX) u . The third equality follows from a rearrangement 
of terms and realizing that the state of r on systems E\E 2 X is equivalent to the state of 
uj on the same systems. The second inequality follows from the quantum data processing 





(19.61) 




(19.62) 


H(E 2 \E 1 X) tJ 


(19.63) 


H(E 2 \E 1 X) T 


(19.64) 


H{E 2 \YX) e 


(19.65) 




(19.66) 
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inequality I(E2',Ei\X) T < I(E2]Y\X) e . The final inequality follows because the state u 



is a state of the form in (19.57), because the entropy difference is never greater than the 



Holevo information of the first channel, and from the result of Exercise |19.4.1[ The same 
reasoning follows for the other entropy difference and by noting that the classical system is 
the composite system XY . □ 

19.4.3 Classical Capacity of the Depolarizing Channel 

The qudit depolarizing channel is another example of a channel for which we can compute 
its classical capacity. Additionally, we will see that achieving the classical capacity of this 
channel requires a strategy which is very "classical" — it is sufficient to prepare classical states 
{|x)(a;|} at the input of the channel and to measure each channel output in the same basis 



(see Exercise 19.4.3). Though, we will later see in Chapter 23 that the depolarizing channel 
has some rather bizarre, uniquely quantum features when considering its quantum capacity, 
even though the features of its classical capacity are rather classical. 



Recall from Section 4.4.6 that the depolarizing channel is the following map: 

A/D(p) = (l-p)p + Fr, (19.67) 

where it is the maximally mixed state. 

Theorem 19.4.2 (Classical Capacity of the Depolarizing Channel). The classical capacity 
of the qudit depolarizing channel Md is as follows: 

X (A^) = log 2( i+(l-p + ^)log 2 (l-p + ^)+(d-l)^log 2 (^), (19.68) 

Proof. The first part of the proof of this theorem relies on a somewhat technical result, 
namely, that the Holevo information of the tensor product channel A/d <8> A/" is additive 
(where the first channel is the depolarizing channel and the other is arbitrary): 

x(A/d ® AT) = X (A/d) + xW- (19-69) 

This result is due to King [171] , and it exploits a few properties of the depolarizing channel. 
The result implies that the classical capacity of the depolarizing channel is equal to its Holevo 
information. We now show how to compute the Holevo information of the depolarizing 
channel. To do so, we first determine the minimum output entropy of the channel. 

Definition 19.4.1 (Minimum Output Entropy). The minimum output entropy H m i n (J\f) of 
a channel M is the minimum of the entropy at the output of the channel: 

HrtJff)^ min H{N{p)) } (19.70) 

p 

where the minimization is over all states input to the channel. 
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Exercise 19.4.2 Prove that it is sufficient to minimize over only pure state input states to 
the channel when computing the minimum output entropy. That is, 

H min (Af) = min#(AW)M)). (19-71) 

W 

The depolarizing channel is a highly symmetric channel. For example, if we input a pure 
state |-0) to the channel, the output is as follows: 

(l-p)^+pn = (l-p)i) + -I (19.72) 

= (l-p)^ + |(V> + /-V) (19-73) 



(l-p + ^ + ^(I-^) (19.74) 



d) d 

Observe that the eigenvalues of the output state are the same for any pure state and are 
equal to 1 — p + 2 with multiplicity one and % with multiplicity d — 1. Thus, the minimum 
output entropy of the depolarizing channel is just 

^«in(AT D ) = -(l-i)+|)log 2 (l-p + |)-(d-l)^log 2 (^). (19.75) 

We now compute the Holevo information of the depolarizing channel. Recall from Theo- 



rem 



12. 3. 2| that it is sufficient to consider optimizing the Holevo information over a classical- 
quantum state with conditional states that are pure (a state a XA of the form in (19.54)). 



Also, the Holevo information has the following form: 

max/(A; B) u = max[iJ(B), - H(B\X)J, (19.76) 

a a 

where uo XB is the output state. Consider the following augmented input ensemble: 



p XIJA> 



d-1 

«EE px{x)\x)(x\ x ® |i)(i| j ® m\ J ® ^(ozoo^'^ooxtw, (19.77) 



d 2 

a; «,J=0 



where X(i) and Z(jf) are the generalized Pauli operators from Section 3.6.2 Suppose that 
we trace over the I J system. Then the state p XA ' is as follows: 



by recalling the result of Exercise 4.4.9. Also, note that inputting the maximally mixed state 



to the depolarizing channel results in the maximally mixed state at its output. Consider the 
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following chain of inequalities: 

I{X;B) U = H(B) U -H(B\X) U (19.79) 

<H{B) p -H{B\X) ul (19.80) 

= log 2 d-H(B\XIJ) p (19.81) 

= log.d-J^Pxi^HiB)^^ (19.82) 

X 

< log 2 d- mm HiB)^^ (19.83) 

<log 2 d-i/ min (AT D ) (19.84) 

The first equality follows by expanding the quantum mutual information. The first inequality 
follows from concavity of entropy. The second equality follows because the state of p on 
system B is the maximally mixed state it and from the following chain of equalities: 

d-l 

H ( B \ XI J) P = ^EE Px^)H(B) MD(x{l)ziMfzH3)xHl)) (19.85) 

x i,j=0 
d-l 

= ^EE Px^)H(B) x{i)Z[jWDi ^ )zHj)xHt) (19.86) 

x i,j=0 

= J2px(x)H(B) Md{ ^, ) (19.87) 

X 

= H(B\X) U . (19.88) 



The third equality in (19.82) follows from the above chain of equalities. The second in- 
equality in (19.83) follows because the expectation is always more than the minimum (this 
step is not strictly necessary for the depolarizing channel). The last inequality follows be- 
cause mm x H(B) Af ,, A /s > H min (Ni)) (though it is actually an equality for the depolarizing 
channel). An ensemble of the following form suffices to achieve the classical capacity of the 
depolarizing channel: 

d-l 

-^)<*| J ®|z)(*| A ', (19.89) 

i=0 

because we only require that the reduced state on A' be equivalent to the maximally mixed 
state. The final expression for the classical capacity of the depolarizing channel is as stated 



in Theorem 19.4.2, which we plot in Figure 19.5 as a function of the dimension d and the 



depolarizing parameter p. □ 

Exercise 19.4.3 (Achieving the classical capacity of the depolarizing channel) We 

actually know that even more is true regarding the method for achieving the classical capacity 
of the depolarizing channel. Prove that it is possible to achieve the classical capacity of the 
depolarizing channel by choosing states from an ensemble {-r, |x)(x|} and performing a von 
Neumann measurement in the same basis at the output of each channel. That is, the naive 
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Dimension d 



Depolarizing parameter p 



Figure 19.5: The classical capacity of the quantum depolarizing channel as a function of the dimension d 
of the channel and the depolarizing parameter p. The classical capacity vanishes when p = 1 because the 
channel replaces the input with the maximally mixed state. The classical capacity is maximal at log 2 d when 
p = because there is no noise. In between these two extremes, the classical capacity is a smooth function 



of p and d given by the expression in (19.681. 



scheme outlined in Section 19.1 is sufficient to attain the classical capacity of the depolarizing 
channel. (Hint: First show that the classical channel Py\x{v\x) induced by inputting a state 
\x) to the depolarizing channel and measuring \y) at the output is as follows: 

p Y]x (y\x) = (l-p)5 x , y +^. (19.90) 

Then show that the distribution Py{v) is uniform if px(x) is uniform. Finally, show that 

fl(y|Jf)— (l-p+EJlog,^ -p+H)- (««-!)(?) loftg). (19.91) 

Conclude that the classical capacity of the induced channel Py\x(v\ x ) is the same as that for 
the quantum depolarizing channel.) 

Exercise 19.4.4 A covariant channel Ac is one for which the state resulting from a unitary 
U acting on the input state before the channel occurs is equivalent to one where there is an 
irreducible representation of the unitary Ru acting on the output of the channel: 

M c {UpU ] ) = RuNMrfu- (19.92) 

Show that the Holevo information x(Ac) of a covariant channel is equal to 

X {Nc) = log 2 d - #(A/c#)), (19.93) 

where if) is an arbitrary pure state. 

Exercise 19.4.5 Compute the classical capacity of the quantum erasure channel. First show 
that it is single-letter. Then show that the classical capacity is equal to 1 — e. 
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19.5 Superadditivity of the Holevo Information 

Many researchers thought for some time that the Holevo information would be additive for all 
quantum channels, implying that it would be a good characterization of the classical capacity 
in the general case — this conjecture was known as the additivity conjecture. Researchers 
thought that this conjecture would hold because they discovered a few channels for which it 
did hold, but without any common theme occurring in the proofs for the different channels, 
they soon began looking in the other direction for a counterexample to disprove it. After 
some time, Hastings found the existence of a counterexample to the additivity conjecture, 
demonstrating that it cannot hold in the general case. This result demonstrates that even 
one of the most basic questions in quantum Shannon theory still remains wide open and that 
entanglement at the encoder can help increase classical communication rates over a quantum 
channel. 

We first review a relation between the Holevo information and the minimum output 
entropy of a tensor product channel. Suppose that we have two channels A/i and Af 2 - The 
Holevo information of the tensor-product channel is additive if 

x (A/i ® M 2 ) = X (M) + #2). (19-94) 

Since the Holevo information is always superadditive for any two channels: 

X(M ® A/" 2 ) > xCA/i) + X(M), (19.95) 



(recall the statement at the beginning of the proof of Theorem 12.3.1), we say that it is 
non-additive if it is strictly superadditive: 

X(M ® A/- 2 ) > X (M) + x(A/" 2 ). (19.96) 

The minimum output entropy ffm in (A/i ® A/2) of the tensor-product channel is a quantity 



related to the Holevo information (see Definition 19.4.1). It is additive if 

#min(A/i <g> N 2 ) = #min(A/l) + ^min(A' 2 ). (19.97) 

Since the minimum output entropy is always subadditive: 

#min(A/l <g> A/2) < tfmmCM) + ^min(M), (19.98) 

we say that it is non-additive if it is strictly subadditive: 

#min(A/i ® A" 2 ) < tf^M) + H min (N 2 ). (19.99) 

Additivity of these two quantities is in fact related — it is possible to show that additivity 
of the Holevo information implies additivity of the minimum output entropy and vice versa 
(we leave one of these implications as an exercise). Thus, researchers considered additivity 
of minimum output entropy rather than additivity of Holevo information because it is a 
simpler quantity to manipulate. 
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Exercise 19.5.1 Prove that non-additivity of the minimum output entropy implies non- 
additivity of the Holevo information: 

#min(M <8> N 2 ) < H min {Mi) + H miQ (Af 2 ) 

=> x{M l ®N 2 )> X {M l ) + x{M2). (19.100) 

{Hint: Consider an augmented version J\[[ of each channel A/i, that has its first input be 
the same as the input to A/i and its second input be a control input, and the action of the 
channel is equivalent to measuring the auxiliary input a and applying a generalized Pauli 
operator: 

M-{p®a) = ^2x{k)Z{l)Mi{p)Z\l)X\k) (k\(l\a\k)\l). (19.101) 

k,l 

What is the Holevo information of the augmented channel A/?? What is the Holevo infor- 
mation of the tensor product of the augmented channels M[ <8> A/^?) After proving the above 
statement, we can also conclude that additivity of the Holevo information implies additivity 
of the minimum output entropy. 

We briefly overview the main ideas behind the construction of a channel for which the 
Holevo information is not additive. Consider a random-unitary channel of the following 
form: 

D 

8(p) = J2PiUipUl (19.102) 

where the dimension of the input state is A" and the number of random unitaries is D. This 
channel is "random-unitary" because it applies a particular unitary Ui with probability pi to 
the state p. The cleverness behind the construction is not actually to provide a deterministic 
instance of this channel, but rather, to provide a random instance of the channel where 
both the distribution and the unitaries are chosen at random, and the dimension A" and the 
number D of chosen unitaries satisfy the following relationships: 

1 < D <C N. (19.103) 

The other channel to consider to disprove additivity is the conjugate channel 

D 

£{p) = Y,PPipU\, (19.104) 

i=l 

where pi and Ui are the same respective probability distribution and unitaries from the 
channel £, and Ui denotes the complex conjugate of Ui. The goal is then to show that there 
is a non-zero probability over all channels of these forms that the minimum output entropy 
is non-additive: 

H min (£ ®£)< H mm (£) + H min (£). (19.105) 
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A good candidate for a state that could saturate the minimum output entropy H m i n {£ <g> £) 
of the tensor-product channel is the maximally entangled state |$), where 

N-l 



*> = -?JiFX>>l*>- ^ 19 - 106 ) 



N n 



Consider the effect of the tensor-product channel £ <8> £ on the maximally entangled state $: 

= Yl P&i ( U * ® ^) $ (^ ® ^) (19.107) 

= J>?(0i® ^) $ (^ ®^!) + ^PiPM ®U 3 )<S>(U}®U^ (19.108) 



= (X> 2 Wep^®^K^® f 0' (19 - 109) 

where the last line uses the fact that (M ® 7)|$) = (7 <8) M T ) |$) for any operator M (this 
implies that ([/ <g) C/)|$) = |$)). When comparing the above state to one resulting from 
inputting a product state to the channel, there is a sense in which the above state is less 
noisy than the product state because D of the combinations of the random unitaries (the 
ones which have the same index) have no effect on the maximally entangled state. Using 
techniques from Ref. |125j . we can make this intuition precise and obtain the following upper 
bound on the minimum output entropy: 

H mm (£®£) <H ((£&£)($)) (19.110) 

<21n£>-^— , (19.111) 

for N and D large enough. Though, using techniques in the same paper, we can also show 
that 

H min {£) >lnD- <5S max , (19.112) 

where 



r , „ „( /ln-V 



5 c™ ^ _ + poly( y ; (igil3) 

c is a constant, and poly(D) indicates a term polynomial in D. Thus, for large enough D 
and N, it follows that 

25S max < ^— , (19.114) 
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and we get the existence of a channel for which a violation of additivity occurs, because 

H mia (£® £) < 2lnD- — (19.115) 



< 2 In D - 2<5S max (19.116) 

<# mm (£) + # mm (£). (19.117) 



19.6 Concluding Remarks 



The Holevo-Schumacher- Westmoreland (HSW) theorem offers a good characterization of the 
classical capacity of certain classes of channels, but at the same time, it also demonstrates 
our lack of understanding of classical transmission over general quantum channels. To be 
more precise, the Holevo information is a useful characterization of the classical capacity of 
a quantum channel whenever it is additive, but the regularized Holevo information is not 
particularly useful as a characterization of it because we cannot even compute this quantity. 
This suggests that there could be some other formula that better characterizes the classical 
capacity (if such a formula were additive). As of the writing of this book, such a formula is 
unknown. 

Despite the drawbacks of the HSW theorem, it is still interesting because it at least offers 
a step beyond the most naive characterization of the classical capacity of a quantum channel 
with the regularized accessible information. The major insight of HSW was the construction 
of an explicit POVM (corresponding to a random choice of code) that allows the sender 
and receiver to communicate at a rate equal to the Holevo information of the channel. This 
theorem is also useful for determining achievable rates in different communication scenar- 
ios: for example, when two senders are trying to communicate over a noisy medium to a 
single receiver and when a single sender is trying to transmit both classical and quantum 
information to a receiver. 

The depolarizing channel is an example of a quantum channel for which there is a simple 
expression for its classical capacity. Furthermore, the expression reveals that the scheme 
needed to achieve the capacity of the channel is rather classical — it is only necessary for the 
sender to select codewords uniformly at random from some orthonormal basis, and it is only 
necessary for the receiver to perform measurements of the individual channel outputs in the 
same orthonormal basis. Thus, the coding scheme is classical because entanglement plays no 
role at the encoder and the decoding measurements act on the individual channel outputs. 

Finally, we discussed Hastings' construction of a quantum channel for which the heralded 
additivity conjecture does not hold. That is, there exists a channel where entanglement at 
the encoder can improve communication rates. This superadditive effect is a uniquely quan- 



tum phenomenon (recall that Theorem 12.1.1 states that the classical mutual information of 



a classical channel is additive, and thus correlations at the input cannot increase capacity). 
This result implies that our best known characterization of the classical capacity of a quan- 
tum channel in terms of the channel's Holevo information is far from being a satisfactory 
characterization of the true capacity, and we still have much more to discover here. 
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19.7 History and Further Reading 

Holevo was the first to prove the bound bearing his name, regarding the transmission of 
classical information with a noisy quantum channel |142j . and Holevo |144] , Schumacher and 
Westmoreland |219] many years later proved that the Holevo information is an achievable 
rate for classical data transmission. Just prior to these works, Hausladen et al. proved 
achievability of the Holevo information for the special case of a channel that accepts a 
classical input and outputs a pure state conditional on the input |126] . They also published 
a preliminary article |127j in which they answered the catchy question (for the special case 
of pure states), "How many bits can you fit into a quantum-mechanical it?" 

King first proved additivity of the Holevo information for unital qubit channels [170J and 
later showed it for the depolarizing channel [171] . Shor later showed the equivalence of several 
additivity conjectures [228] (that they are either all true or all false). Hayden |131| . Winter 
[258], and a joint paper between them |136j proved some results leading up to the work of 
Hastings [125 j . who demonstrated a counterexample to the additivity conjecture. Thus, by 
Shor's aforementioned paper, all of the additivity conjectures are false in general. There has 
been much follow-up work in an attempt to understand Hastings' result [99l HTj |98j [TT] . 

Some other papers have tried to understand the HSW coding theorem from the perspec- 
tive of hypothesis testing. Hayashi began much of this work |130j, and he covers quite a bit 
of quantum hypothesis testing in his book |128j . Datta and Mosonyi followed up with some 
work along these lines |192j . as did Renner and Wang |244] . 
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CHAPTER 20 



Entanglement- Assisted Classical 
Communication 



We have learned that shared entanglement is often helpful in quantum communication. This 
is certainly true for the case of a noiseless qubit channel. Without shared entanglement, the 
most classical information that a sender can reliably transmit over a noiseless qubit channel 



is just one classical bit (recall Exercise 4.2.2 and the Holevo bound in Exercise 11.9.1). 



With shared entanglement, they can achieve the super-dense coding resource inequality from 
Chapter [7J 

[ 9 -g] + [<re]>2[c-c]. (20.1) 

That is, with one noiseless qubit channel and one shared noiseless ebit, the sender can reliably 
transmit two classical bits. 

A natural question then for us to consider is whether shared entanglement could be helpful 
in transmitting classical information over a noisy quantum channel J\f . As a first simplifying 
assumption, we let Alice and Bob have access to an infinite supply of entanglement, in 
whatever form they wish, and we would like to know how much classical information Alice 
can reliably transmit to Bob over such an entanglement-assisted quantum channel. That is, 
we would like to determine the highest achievable rate C of classical communication in the 
following resource inequality: 

(M) + oo[qq) >C[c^c]. (20.2) 

The answer to this question is one of the strongest known results in quantum Shan- 
non theory, and it is given by the entanglement-assisted classical capacity theorem. This 
theorem states that the mutual information I(Af) of a quantum channel M is equal to its 
entanglement-assisted classical capacity, where 

UN) = maxI(A;B) , (20.3) 

p AB = J\f A ^ B ((j) AA ), and the maximization is over all pure bipartite states of the form (j) AA . 
We should stress that there is no need to regularize this formula in order to characterize the 
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capacity (as done in the previous chapter and as is so often needed in quantum Shannon 
theory). The value of this formula is the capacity. Also, the optimization task that the 

Any local 



formula in (|20.3|) sets out is a straightforward convex optimization program. 

e the quantum mutual information is cor 
from Chapter 12) and the set of density operators is 



maximum is a global maximum becaus e the quantum mutual information is concave in the 
A ' (recall Theorem 



12.4.2 



input state 
convex. 

From the perspective of an information theorist, we should only say that a capacity 
theorem has been solved if there is a tractable formula equal to the optimal rate for achieving 
a particular operational task. The formula should apply to an arbitrary quantum channel, 
and it should be a function of that channel. Otherwise, the capacity theorem is still unsolved. 
There are several operative words in the above sentences that we should explain in more 
detail. The formula should be tractable, meaning that it sets out an optimization task which 
is efficient to solve in the dimension of the channel's input system. The formula should give 
the optimal achievable rate for the given information processing task, meaning that if a rate 
exceeds the capacity of the channel, then the probability of error for any such protocol should 
be bounded away from zero as the number of channel uses grows largejj Finally, perhaps the 
most stringent (though related) criterion is that the formula itself (and not its regularization) 
should give the capacity of an arbitrary quantum channel. Despite the success of the HSW 
coding theorem in demonstrating that the Holevo information of a channel is an achievable 
rate for classical communication, the classical capacity of a quantum channel is still unsolved 
because there is an example of a channel for which the Holevo information is not equal to 



that channel's capacity (see Section 19.5). Thus, it is rather impressive that the formula in 



(20.3) is equal to the entanglement-assisted classical capacity of an arbitrary channel, given 



the stringent requirements that we have established for a formula to give the capacity. In 
this sense, shared entanglement simplifies quantum Shannon theory. 

This chapter presents a comprehensive study of the entanglement-assisted classical ca- 
pacity theorem. We begin by defining the information processing task, consisting of all the 
steps in a general protocol for classical communication over an entanglement-assisted quan- 
tum channel. We then present a simple example of a strategy for entanglement-assisted 
classical coding that is inspired by dense coding, and in turn, that inspires a strategy for 



the general case. Section 20.3 states the entanglement-assisted classical capacity theorem. 



Section 20.4 gives a proof of the direct coding theorem, making use of quantum typicality 



from Chapter 14 the Packing Lemma from Chapter 15 and ideas in the entanglement con 



centration protocol from Chapter 18. It demonstrates that the rate in (20.3) is an achievable 



rate for entanglement-assisted classical communication. After taking a step back from the 
protocol, we can realize that it is merely a glorified super-dense coding applied to noisy 



quantum channels. Section |20.5| gives a proof of the converse of the entanglement-assisted 
classical capacity theorem. It exploits familiar tools such as the Alicki-Fannes' inequality, 
the quantum data processing inequality, and the chain rule for quantum mutual information 



1 We could strengthen this requirement even more by demanding that the probability of error increases 
exponentially to one in the asymptotic limit. Fulfilling such a demand would constitute a proof of a strong 
converse theorem. 
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(all from Chapter 11), and the last part of it exploits additivity of the mutual information 



of a quantum channel (from Chapter 12). The converse theorem establishes that the rate 
in (20.3) is optimal. With the proof of the capacity theorem complete, we then show the 
interesting result that the classical capacity of a quantum channel assisted by a quantum 
feedback channel is equal to the entanglement-assisted classical capacity of that channel. 
We close the chapter by computing the entanglement-assisted classical capacity of both a 
quantum erasure channel and an amplitude damping channel, and we leave the computation 
of the entanglement-assisted capacities of two other channels as exercises. 

20.1 The Information Processing Task 

We begin by explicitly defining the information processing task of entanglement-assisted 
classical communication, i.e., we define an (n, C — S,e) entanglement-assisted classical code 
and what it means for a rate C to be achievable. Prior to the start of the protocol, we 
assume that Alice and Bob share pure-state entanglement in whatever form they wish. For 
simplicity, we just assume that they share a maximally entangled state of the following form: 

d-l 

\^fAT B _ Y,\i) TA \i) TB , (20.4) 

where the dimension d is as large as they would like it to be. Alice selects some message 
m uniformly at random from a set A4 of messages. Let M denote the random variable 
corresponding to Alice's random choice of message, and let \A4\ be the cardinality of the 
set A4. She applies some CPTP encoding map £^ A ^ An to her half of the entangled state 
($r A T B d e p enc ii n g on her choice of message m. The global state then becomes 

£lr A '\^ TATB )- (20.5) 

Alice transmits the systems A' n over n independent uses of a noisy channel J\f ~^ B , leading 
to the following state 

N A ' n - Bn (£lr A ' n ^ TATB )), (20.6) 

where J\f A ' n ^ Bn = (A/ Vl_> - B )® n . Bob receives the systems B n , combines them with his share 
Tb of the entanglement, and performs a POVM {A^" Ts } on the channel outputs B n and his 
share Tb of the entanglement in order to detect the message m that Alice transmits. Fig- 
ure 20.1 depicts such a general protocol for entanglement-assisted classical communication. 
Let M' denote the random variable for the output of Bob's decoding POVM (this rep- 
resents Bob's estimate of the message). The probability of Bob correctly decoding Alice's 
message is 

Pr{M' = m \M = m} = T^A^Af*"^ (£^ A ' n ($ T ^))}, (20.7) 

and thus the probability of error p e (m) for message m is 

Pe (m) = Tr{(/ - Ar B )N A ' n - Bn {E T m ^ A ' n {^^))}. (20.8) 
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Alice 



Bob 




Figure 20.1: The most general protocol for entanglement-assisted classical communication. Alice applies 
an encoder to her classical message M and her share T4 of the entanglement, and she inputs the encoded 
systems A' n to many uses of the channel. Bob receives the outputs of the channel, combines them with his 
share of the entanglement, and performs some decoding operation to estimate Alice's transmitted message. 



The maximal probability of error p* e for the coding scheme is 



The rate C of communication is 



p* e = maxp e (m). 



C=-log 2 \M\ + 5, 
n 



(20.9) 



(20.10) 



where 5 is an arbitrarily small positive number, and the code has e error if jo* < e. A rate C 
of entanglement-assisted classical communication is achievable if there exists an (n, C — 5,e) 
entanglement-assisted classical code for all 5, e > and sufficiently large n. 

20.2 A Preliminary Example 



Let us first recall a few items about qudits. The maximally entangled qudit state is 



m 



AB 



1 d-1 

n Z» l»> • 



Vd 



(20.11) 



i=0 



Recall from Section 3.6.2 that the Heisenberg-Weyl operators X(x) and Z(z) are an extension 
of the Pauli matrices to d dimensions: 



d-l 



d-i 



X(x) = ^2\x + x')(x'\, Z(z) = J2 e 



2-irizz'/d\ /\/ z h 



(20.12) 



x'=0 



z'=0 
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Figure 20.2: A simple scheme, inspired by super-dense coding, for Alice and Bob to exploit shared entan- 
glement and a noisy channel in order to establish an ensemble at Bob's receiving end. 



, AB 



Let \& x ,z) denote the state that results when Alice applies the operator X(x)Z(z) to her 
share of the maximally entangled state |$) 



AB 



^ z ) AB = (x A ( X )z A (z)®i B )m 



AB 



(20.13) 



Recall from Exercise 
mal basis: 



3.6.11 



d-l 



that the set of states \ \<$> x z ) > forms a complete orthonor- 

l ' J x,z=0 



d-l 



($ , ,|d> \ = 5 ,5 , ^P 1$ U$ I = I AB 

\^ tr x',z'\^ f x,z/ u x,x u z,z' ) / \^x,z/ \^x,z | - 1 



(20.14) 



x,z=0 



Let 7r AB denote the maximally mixed state on Alice and Bob's system: ir AB = I AB /d 2 , and 
let tt a and tt b denote the respective maximally mixed states on Alice and Bob's systems: 
ir A = I A /d and ir B = I B /d. Observe that ir AB = ir A <8) ir B . 

We now consider a simple strategy, inspired by super-dense coding and the HSW coding 



scheme from Theorem 19.3.1, that Alice and Bob can employ for entanglement-assisted 
classical communication. That is, we show how a strategy similar to super-dense coding 
induces a particular ensemble at Bob's receiving end, to which we can then apply the HSW 
coding theorem in order to establish the existence of a good code for entanglement-assisted 
classical communication. Suppose that Alice and Bob possess a maximally entangled qudit 
state |$) . Alice chooses two symbols x and z uniformly at random, each in {0, . . . , d — 1}. 
She applies the operators X(x)Z(z) to her side of the maximally entangled state |$) , and 
the resulting state is \<& x ,z) ■ She then sends her system A over the noisy channel Af A ^ B , 
and Bob receives the output B' from the channel. The noisy channel on the whole system 
is J\f A ^ B <8> I B ', and the ensemble that Bob receives is as follows: 



1 



N 



A^B' 



i B m B z) 



(20.15) 



This constitutes an ensemble that they can prepare with one use of the channel and one 



shared entangled state (Figure 20.2 depicts all of these steps). But, in general, we allow 



them to exploit many uses of the channel and however much entanglement that they need. 
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Bob can then perform a collective measurement on both his half of the entanglement and 
the channel outputs in order to determine a message that Alice is transmitting. 



Consider that the above scenario is similar to HSW coding. Theorem |19.3.1| from the 
previous chapter proves that the Holevo information of the above ensemble is an achievable 
rate for classical communication over this entanglement-assisted quantum channel. Thus, we 



can already state and prove the following corollary of Theorem 19.3.1, simply by calculating 



the Holevo information of the ensemble in (20.15). 



Corollary 20.2.1. The quantum mutual information I(A; B) a of the state o AB = Af A ~* B {<^ AA ] 
is an achievable rate for entanglement- assisted classical communication over a quantum chan- 
nel M A '^ B . 



Proof. Observe that we can map the ensemble in (20.15) to the following classical-quantum 
state: 



P 



XZB'B 



E ^><*| X ® \ z )^ Z ® {^ B ' ® I B ) (*£)■ (20-16) 



x,z=0 



The Holevo information of this classical-quantum state is 

I{XZ- B'B) p = H{B'B) p - H(B'B\XZ) p , (20.17) 

and it is an achievable ra te for e ntanglement-assisted classical communication over the chan- 
nel M A ~^ B by Theorem 19.3.1 We now proceed to calculate it. First, we determine the 
entropy H(B'B) by tracing over the classical registers XZ: 



Tr 



xz 



[p XZB ' B ] 



d-l 

^ ~dP 

x.z=0 



M A - B ' ® I B ) (*£) 



d-l 



Af A ^ B ' ® I B 



V 



A-^B' 



E ^ 

X,Z=0 

I B ) {k ab ) 



M 



A^B' 



(ir A ) <g> 



TV 



(20.18) 

(20.19) 

(20.20) 
(20.21) 



where the third equality follows from (20.14). Thus, the entropy H(B'B) is as follows: 

H(B'B) = H{M A ^ B '(tt a )) + H(tt b ). (20.22) 

We now determine the conditional quantum entropy H(B'B\XZ) : 
H{B'B\XZ), 



E^(K 5 '»'°)K)) 



x,z=0 

d-l 

d? ^ 

x,z=0 



h(m a ^ b ' [{X a (x)Z a (z)) ($ ab ) (Z^ a (z)X^ a (x))] 



(20.23) 
(20.24) 
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1 



Y J H(u A - B '[{Z T ) B {z){X T ) B {x){^ AB )X* B {x)Z* B {z)\) 



1 

d 2 



x,z=0 

1 



£ ff(P) tf (,)(iT( 

X,2=0 

H(X^ B '(<f> AB 



x 



M 



A^B' 



($ AB ) (X* B (x)Z* B (z)) 



(aT a ^'(c 



(20.25) 

(20.26) 
(20.27) 



The first equality follows because the system XZ is classical (recall the result in Sec- 
). The second equality follows from the definition of the state <& AB . The third 



tion 



11.4.1 



equality follows by exploiting the Bell-state matrix identity in Exercise |3.6.12[ The fourth 
equality follows because the unitaries that Alice applies commute with the action of the 
channel. Finally, the entropy of a state is invariant under any un itaries applied to that state. 
So the Holevo information I(XZ; B'B) of the state p XZB ' B in (J20.16I) is 



I(XZ; B'B) p = H{N{n A )) + H(tt b ) -H([[N A ^ B ' <g> I B ) ($ AB )), (20.28) 



(20.29) 



Equivalently, we can write it as the following quantum mutual information: 

HA;B) a , 
with respect to the state o AB , where 

o AB =N A '^ B {$> AA '). (20.30) 

□ 



For some channels, the quantum mutual information in Corollary |20 . 2 . 1 1 is equal to that 
channel's entanglement-assisted classical capacity. This occurs for the depolarizing channel, 
a dephasing channel, and an erasure channel to name a few. But there are examples of 
channels, such as the amplitude damping channel, where the quantum mutual information 



in Corollary 20.2.1 is not equal to the entanglement-assisted capacity. In the general case, 



it might perhaps be intuitive that the quantum mutual information of the channel in (20.3) 



is equal to the entanglement-assisted capacity of the channel, and it is the goal of the next 
sections to prove this result. 

Exercise 20.2.1 Consider the following strategy for transmitting and detecting classical 
information over an entanglement-assisted depolarizing channel. Alice selects a state \<& x ,z) 
uniformly at random and sends the A system over the quantum depolarizing channel Nq 



A->B' 



where 



rA-^B' i 



M^\p) = {l-p)p +F K. (20.31) 

Bob receives the output B' of the channel and combines it with his share B of the entangle- 
ment. He then performs a measurement of these systems in the Bell basis \ \& X ',z')(&x',z'\ \- 
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Determine a simplified expression for the induced classical channel pz'X'\zx(z',x' | z,x) 
where 

Pz>x>\zx{z\x' I z,x) = ($ x >M(M A ^ B ' ® I B )(\$ x , z )($x,z\ AB )\$ x >,z'}- (20.32) 

Show that the classical capacity of the channel Pz'X>\zx( z> , x> I z , x ) is equal to the entanglement- 
assisted classical capacity of the depolarizing channel (you can take it for granted that 
the entanglement-assisted classical capacity of the depolarizing channel is given by Corol- 



lary 20.2.1). Thus, there is no need for the receiver to perform a collective measurement on 
many channel outputs in order to achieve capacity — it suffices to perform single-channel Bell 
measurements at the receiver. 



20.3 The Entanglement- Assisted Classical Capacity The- 
orem 



We now state the entanglement-assisted classical capacity theorem. Section 20.4 proves the 



direct part of this theorem, and Section 20.5 proves its converse part. 



Theorem 20.3.1 (Bennett-Shor-Smolin-Thapliyal). The entanglement-assisted classical ca- 
pacity of a quantum channel is the supremum over all achievable rates for entanglement- 
assisted classical communication, and it is equal to the channel's mutual information: 

sup{C | C is achievable} = I(J\f), (20.33) 

where the mutual information I(Af) of a channel Af is defined as I(J\f) = max^^.4' I(A; B) , 
p AB = J\f A ~^ B (ip AA ), and ip AA is a pure bipartite state. 

20.4 The Direct Coding Theorem 

The direct coding theorem is a statement of achievability: 

Theorem 20.4.1 (Direct Coding). The following resource inequality corresponds to an 
achievable protocol for entanglement- assisted classical communication over a noisy quantum 
channel: 

(A/") + H(A) p [qq] > I(A; B) p [c - c], (20.34) 

where p AB =M A '^ B (ip AA '). 

We suppose that Alice and Bob share n copies of an arbitrary pure, bipartite entangled 
state \<p) . The amount of entanglement in this state is equivalent to nH(A) ebits. We 



would like to apply a similar coding technique as outlined in Section 20721 For example, it 



would be useful to exploit the transpose trick from Exercise 3.6.12 but we cannot do so di 



rectly because this trick only applies to maximally entangled states. Though, we can instead 
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,AB 



exploit the fact that Alice and Bob share many copies of the state \ip) J that decompose 
into a direct sum of maximally entangled states. The development is similar to that which 
we outlined for the entanglement concentration protocol from Chapter [18} First, recall that 



(20.35) 



every pure, bipartite state has a Schmidt decomposition (see Theorem 3.6.1): 



\<P) 



AB 



Y^ Vpx(x)\x) a \x) b , 



where px(x) > 0, ^2 x Px(x) = 1, and {\x) } and {\x) } are orthonormal bases for Alice 
and Bob's respective systems. Let us take n copies of the above state, giving a state of the 
following form: 



\<p) AnB ' n = £ ^/p~xM^)\x n ) An \x n ) B \ (20.36) 



where 



X — X\ ' ' ' x n ^ 
p X n{x n ) =Px(xi)---px{x n ), 

|X j = 1*^1/ " " " l&n) • 



(20.37) 
(20.38) 
(20.39) 



We can write the above state in terms of its type decomposition (just as we did in ( J18.29 - 

em 

\<P) 



18.32) for the entanglement concentration protocol): 

A n B n ^ 

t X n &T t 



E E vW^kYvr 



t x n eT t 

= j2Vp^m^rJ:\x n ) An \xy n 



x n dT t 



Ev^w 



A n B n 



with the following definitions: 



\*t) 



P(t) =p X n(Xt)d t , 

A n B n 



\= Y\xY n \x n ) Bn . 



(20.40) 
(20.41) 

(20.42) 
(20.43) 

(20.44) 
(20.45) 



i"6T( 



We point the reader to (18.29 18.32) for explanations of these equalities. 

Each state \<& t ) ' is maximally entangled with Schmidt rank d t , and we can thus apply 
the transpose trick for operators acting on the type class subspaces. Inspired by the dense- 



coding-like strategy from Section |20.2[ we allow Alice to choose unitary operators from the 
Heisenberg-Weyl set of d\ operators that act on the A n share of |$ t ) . We denote one 
of these operators as V(x t , z t ) = X(x t )Z(z t ) where x t , z t e {0, . . . , d t — 1}. If she does this 
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for every type class subspace and applies a phase (—1) ' in each subspace, then the resulting 
unitary operator U(s) acting on all of her A n systems is a direct sum of all of these unitaries: 

U(s) = Q)(-l) bt V(x t ,z t ), (20.46) 

t 

where s is a vector containing all of the indices needed to specify the unitary U(s): 

s = ((x t ,z t ,bt)) t . (20.47) 

Let S denote the set of all possible vectors s. The transpose trick holds for these particular 
unitary operators: 



U(s) A " ® I Bn ) \<p) AnBn = (l An ® (U T (s)) Bn ) \ V ) AnBn (20.48) 

because it applies in each type class subspace: 

' TJ( \& n o, rB n \ i \A n B n 

U(s) ®/ J\(f) 

0(-l)*V(*t,*)J Ev^il^)^" (20-49) 

^vW)(-l) bt V(xuZt) A "\^) A " Bn (20.50) 

t 

J2vW)(-l) bt V T (x u z t f n \<S> t ) AnBn (20.51) 



B v 



($(-lfV T ( Xu z t )\ J2^W)\^) AnBn ( 20 - 52 ) 

\ t / t 

= {l An ®{U T {s)) Bn )\v) AnBn (20.53) 

Now we need to establish a means by which Alice can select a random code. For every 
message m G M. that Alice would like to transmit, she chooses the elements of the vector 
s E S uniformly at random, leading to a particular unitary operator U(s). We can write 
s(m) instead of just 5 to denote the explicit association of the vector s with the message m— 
we can think of each chosen vector s(m) as a classical codeword, with the codebook being 
{s( m )} m( zfi \m\\- This random selection procedure leads to entanglement-assisted quantum 
codewords of the following form: 

W m f nBn - {u(s( m )) An ® I Bn ) W) AnB \ (20.54) 

Alice then transmits her systems A n through many uses of the noisy channel, leading to the 
following state that is entirely in Bob's control: 

M An - B ' n (\ Vm )^ m \ AnBn )- (20.55) 
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Interestingly, the above state is equal to the state in (20.58) below, by exploiting the trans- 

'"■(, w <A n B n \ 



pose trick from (20.48): 

Af A ^ B 

= Af An - B ' n (u{s{m)) An \ V )^\ AnBn U\s{ m )) An 
= M An ^ B ' n (u T (s(m)) Bn \<p)(p\ AnBn U*( S (rn)) Bn 
= U T {s{m)) Bn M An - B ' n (wm AnBn )u*{s{rn)) Bn . 



(20.56) 
(20.57) 
(20.58) 



Observe that the transpose trick allows us to commute the action of the channel with Alice's 



encoding unitary U(s(m)). Let p B ' lBn 

N A ^ B ' n (\v m ){V 



N 



A n ->B' n 



IvXvl 



A n B' 



so that 



A n B n 



B n B' n B n : 



U T (s(m)) D p B B U*{s{m)) 



(20.59) 



Remark 20.4.1 (Tensor-Power Channel Output States) When using the coding scheme 
given above, the reduced state on the channel output (obtained by ignoring Bob's half of the 
entanglement in B n ) is a tensor-power state, regardless of the unitary that Alice applies at 
the channel input: 

' n ))=p B ' n 

= Af An ^ B ' n (<p An ), 



Tr B „{A^ B '"(W<^ 



\A n B n 



(20.60) 
(20.61) 



where (p An = (Tr b{<P AB }) ■ This follows directly from (20.59) and taking the partial 



trace over B n . We exploit this feature in the next chapter, where we construct codes for 
transmitting both classical and quantum information with the help of shared entanglement. 

After Alice has transmitted her entanglement-assisted quantum codewords over the chan- 
nel, it becomes Bob's task to determine which message m Alice transmitted, and he should 



do so with some POVM {A m } that depends on the random choice of code. Figure 20.3 
depicts the protocol. 



At this point, we would like to exploit the Packing Lemma from Chapter [15] in order 
to establish the existence of a reliable decoding POVM for Bob. Recall that the Packing 
Lemma requires four objects, and these four objects should satisfy the four inequalities in 



( |15.11j|15.14[ ). The first object required is an ensemble from which Alice and Bob can select 

(20.62) 



a code randomly, and in our case, the ensemble is 



1 



151 



U T (s) Bn p B ' nBn U*(s) Bn 



ses 



The next object required is the expected density operator of this ensemble: 

p B ' nBn = Es{u T (S) Bn p B ' nBn U*(S) Bn } 

^u T (sr P B ' nBn u*(sr. 



\S\ 



(20.63) 
(20.64) 



se5 
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(a) 



wr 



Encoding 






E" 








Um 




A" 






/ 






/ . 






\ 






\r_j 


rk 




X ^-|l/ T (5(m))[- 











(b) 



\0n 



Figure 20.3: (a) Alice shares many copies of a pure, bipartite state \ip) with Bob. She encodes a message 
m according to some unitary of the form in (20.461. She transmits her half of the entanglement-assisted 



quantum codeword over many uses of the quantum channel, and it is Bob's task to determine which message 
she transmits, (b) Alice acting locally with the unitary U(s(m)) on her half A n of the entanglement \ip) 
is the same as her acting nonlocally with U T (s(m)) on Bob's half B n of the entanglement. This follows 



because of the particular structure of the unitaries in (20.461 



We later prove that this expected density operator has the following simpler form: 

(7rf)®7rf\ (20.65) 



^B« = j2 p{t )M 



A n ^B" 



where p(t) is the distribution from (20.44) and ir t is the maximally mixed state on a type class 



subspace: 7r t = I t / d t . The final two objects that we require for the Packing Lemma are the 
message subspace projectors and the total subspace projector. We assign these respectively 



as 



u'( S rn*:ru*( S r, 



n 



B' n 
P,5 



n 



P,8 

B n 
p,S ' 



(20.66) 
(20.67) 



where Ii B J Bn , T1 B J, and 11^ are the typical projectors for many copies of the states p B B = 
J\f A ^ B '((p AB ), p B ' = Tr B {p B ' B }, and p B = Tr B ,{p B ' B }, respectively Observe that the 
size of each message subspace projector is ~ 2 nH ( B ' B \ and the size of the total subspace 
projector is m 2 n ^ H( - B '^ +H( - B ^. By dimension counting, this is suggesting that we can pack in 
w 2 n[H{B')+H{B)\ J 2 nH(B'B) = yti{B';B) messages w i t h this coding technique. 



If the four conditions of the Packing Lemma are satisfied (see (15.11 15.14)), then there 



exists a detection POVM that can reliably decode Alice's transmitted messages as long as 



the number of messages in the code is not too high. The four conditions in (15.11 15.14) 
translate to the following four conditions for our case: 



Tr{u T (sf n n B '; Bn U*(s) Bn } < 2 n ^ B ' 1 



Tr{ (u T (s) Bn TL B '; Bn U*(s) Bn ) (u T {sf p B ' n Bn U* {sf 



led 



(20.68) 
(20.69) 
(20.70) 
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{n?;®K s )p B,nBn {KX s 



< 2 -n[H(B>) p +H{B) p - n <nf)-c5] falj* % n B»\ (2Q ^ 



where c is some positive constant and r}(n,5) is a function that approaches zero as n — > oo 
and £ ^ 0. 



creasing difficulty. The condition in (20.69) holds because 



We now prove the four inequalities in (20.68 20.71), attacking them in the order of in 



Tr 



{ 



sm?;ru*(sr)(u T ( S ) 



Bn p B ' nBn U*(s) Bn )} 



7rlU B ; Bn p B ' nB * 
> 1-e. 



(20.72) 
(20.73) 



The equality holds by cyclicity of the trace and because U*U = I. The inequality holds by 



exploiting the unit probability property of typical projectors (Property 14.1.1 ). From this in 
equality, observe that we choose each message subspace projector so that it is exactly the one 
that should identify the entanglement-assisted quantum codeword U T (s) p U*(s 
with high probability. 



B" 



We next consider the condition in (20.70): 



Tr{u T (s) Bn Uf; Bn U*(sr} = Tr{fl^} 

< 2 n[H(B'B) p +cS}_ 



(20.74) 
(20.75) 



The equality holds again by cyclicity of trace, and the inequality follows from the exponen- 



tially small cardinality property of the typical subspace (Property 14.1.2). 



Consider the condition in (20.68). First, define P = I — P. Then 



K ® Ks 



T - Ti B 
1 U P,<5 

I B ' n ® I B ' 



lI B ' n ®<s 



I-U 



B" 
p,5 






nJ7 ® n^ 



> 



(/- 



rB" 



nS" ® i Br 



T B"> 



n 



A j 



(20.76) 

(20.77) 
(20.78) 
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Consider the following chain of inequalities: 

Tr{ (n£ ® n-) (u T (sr P B ' n » n u*( s r) } 

>Tx{u T (s) Bn p B ' nBn U*(s) Bn } 

~ Tr{ (tlf; ® I Bn ) {u T {s) Bn p B ' nBn W{sr) } 
- Tr{ (/*'" S ft£) (t/ T ( S ) B > B '" s "t/*( S ) Bn ) } 
= 1-Tr{n^p^}-Tr{n p y} 
> 1 - 2e. 



(20.79) 

(20.80) 
(20.81) 



The first inequality follows from the development in (|20.76||20.78|). The first equality follows 

because Tr< U T ( 

B' n , respectively (while noting that we can apply the transpose trick for the second one). 
The final inequality follows from the unit probability property of the typical projectors TL B S 



s) p B " lB ' l U*(s) > = 1 and from performing a partial trace on B n and 



and Ii B s (Property 
The last inequa. 



14.1.1). 



ity in (20.71) requires the most effort to prove. We first need to prove 



B' n B n 



takes the form given in (20.65). To simplify the 



that the expected density operator p 

development, we evaluate the expectation without the channel applied, and we then apply 

the channel to the state at the end of the development. Consider that 



P 



A n B n 



^£^n^r B v (s r 



\s 



ses 



' ' ses \ t / V t 

' ' ses V t J 

Y,(^A AnB \-i) bAs \v\{z^x t ,){s))) Bn ^W) 



Vm\u*(s) 



B" 



(20.82) 



(20.83) 



(20.84) 



Let us first consider the case when t = t' . Then the expression in (20.84) becomes 



i^£p(f)(F((, 1 x t )( S ))) B > 1 )($/" B "(r((,,x 1 )( S ))) B " (20.85) 



ses t 

t 
E*(*) 



)- 2 Y^{V T {zuxA) Bn \^r B \V\zuX t )) Bn 



df 



Xt,Zt 



7T. 



(20.86) 
(20.87) 
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These equalities hold because the sum over all the elements in S implies that we are uni- 
formly mixing the maximally entangled states |$t) ' on the type class subspaces and 
Exercise 4.4.9| gives us that the resulting state on each type class subspace is equal to 
TiB"{&f nB "} <8> 7Tjf n = T^f 1 <8) irf n . Let us now consider the case when t ^ t' . Then 
the expression in (20.84) becomes 



\iJ2J2 Vp(t)p(t')(-i) bt 



(«)+V(«). 



1 ses v, t±v 

(V"'(fe,x t )( s ))) B > ( >(* t .| A " i '"(V*(fe.,x ( -)( S ))) J 

E ^ E v*W)(-i) h+ '"x 

£', t^t 1 l bt,b t i ,xt,zt,x t i ,z t i 

(V T (z t ,x t )) Bn \$ t )(<i> t ,\ AnBn (V*(z tl ,x tl )) Bn 

V 1 y (-i)" +fet/ x 
^ d 2 t £, ^ 4 

t', t^t' ' l b t ,b t , 



(20.* 



£ v / R¥M(^ r (^,^)) B > i }(^l A ' ,s '"(^(^,^)) 



x t ,z t ,x,,,z t , 



(20.89) 
(20.90) 



It then follows that 



\-Xu T {sr\^r Bn w{ S r =y j p{^ 



\s 



s£S 



TV 



B n 

t ) 



(20.91) 



and by linearity, that 



^E" 

1 ' se5 



T, ,fl»^A»^ 



l^r'Vto*" 



£p(t)tf X "^(0 »***"■ (20-92) 



We now prove the final condition in (20.71) for the Packing Lemma. Consider the fol 
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lowing chain of inequalities: 



n 



B ,n 



Ks)P B ' nBn { 



n 



B"> 



II 



L P,S 
rA n ^B' 



= EM*)(n^ Mn W")n 

i 



r B' n 
^,5 



TI B n R n y [ B n 



n 



p,6 



n 



p.5 



n 



B" 



-n 



^Trjnf'} 11 ^ 

n[H(B) p - V (n,S)] u B 



2~ n [" (^) p -V(n,o)\ ]J^ n \ 



(20.93) 
(20.94) 

(20.95) 
(20.96) 



The first equality follows from (20.92). The second equality follows by a simple manipula- 
tion. The third equality follows because the maximally mixed state 7T Bn is equivalent to the 



normalized type class projection operator Hf . The inequality follows from Property 14.3.2 



and n^nf Up 5 < H B 8 (the support of a typical type projector is always in the support of 
the typical projector and the intersection of the support of an atypical type with the typical 
projector is null). Continuing, by linearity, the last line above is equal to 



= n B p ';N An - B ' n {tp An )Ti B ;; ® 2 - n [ H ^-^Mu B ; 

< 2-"[^ B ')pHrrj;" ® 2 -n[H(B) p - V (n,8)] Il B- 
= 2 -n[H(B>) p +H(B) p -r,(n,5)-c8)jjB>™ ^ j-rB" 



(20.97) 

(20.98) 
(20.99) 
The first equality follows because ip An = ^2 t p(t)7rf n . The inequality follows from the equipar- 



tition property of typical projectors (Property 14.1.3). The final equality follows by rear- 
ranging terms. 



With the four conditions in (20.68 20.71) holding, it follows from Corollary 15.5.1 (the 



derandomized version of the Packing Lemma) that there exists a deterministic code and 
a POVM {A^" lB "} that can detect the transmitted states with arbitrarily low maximal 
probability of error as long as the size \A4\ of the message set is small enough: 



pl = mz X Tr{(l - AZ nBn )u T (s(m)) Bn p B ' nBn U*(s(m)) Bn } 

< 4(e + 2x/i) + 8 • 2- n l H ( B '^ +H ^p-^ n ^- c5 h n { H{B ' B) e +c5 ] \M | 
= 4(e + 2V~e) + 8 • 2 - n V^ B \-^ s )- 2 'A ^ 



(20.100) 

(20.101) 

(20.102) 

We can choose the size of the message set to be \A4\ = 2 n [ / ( B ' ;S )~ J ? (n ' 5 )~ 3c<5 ] so that the rate 
of communication is 

C= -logoLMI = I(B';B)-r)(n,5)-3c5, (20.103) 

n 
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and the bound on the maximal probability of error becomes 

p* e <4(e + 2^) +8-2~ ncS . (20.104) 

Since e is an arbitrary positive number that approaches zero for sufficiently large n and 5 is 
a positive constant, the maximal probability of error vanishes as n becomes large. Thus, the 
quantum mutual information I(B'; B) , with respect to the state 

p B ' B =Af A - B '(ip AB ), (20.105) 

is an achievable rate for the entanglement-assisted transmission of classical information over 



TV. To obtain the precise statement in Theorem 20.3.1, we can simply rewrite the quantum 
mutual information as I(A; B) with respect to the state 



p AB =M A ^ B Up AA \. (20.106) 

Alice and Bob can achieve the maximum rate of communication simply by determining the 
state tp AA ' that maximizes the quantum mutual information I(A; B) and by generating 
entanglement-assisted classical codes from the state p AB . 

20.5 The Converse Theorem 

This section contains the proof of the converse part of the entanglement-assisted classical 
capacity theorem. Let us begin by supposing that Alice and Bob are trying to use the 
entanglement-assisted channel many times to accomplish the task of common randomness 
generation (recall that we took this approach for the converse of the classical capacity theo- 



rem in Section 19.3.21) r\ An upper bound on the rate at which Alice and Bob can generate 



common randomness also serves as an upper bound on the rate at which they can communi- 
cate because a noiseless classical channel can generate common randomness. In such a task, 
Alice and Bob share entanglement in some pure state |$) A B (though our proof below ap- 

— MM' 

plies to any shared state). Alice first prepares the maximally correlated state $ , and the 
rate of common randomness in this state is C = - log|M|. Alice then applies some encoding 
map £ T *^ A to the classical system M' and her half T4 of the shared entanglement. The 
resulting state is 

^^($ MM ' 8 ^^). (20.107) 

She sends her A n systems through many uses J\f A ' l ^ Bn of the channel M A ^ B ', and Bob 
receives the systems B n , producing the state: 



U) 



MT B B~ _ r . B » ( ^'T,^» ( ^' ^ $^)). (20.108) 



2 We should qualify in this approach that we are implicitly assuming a bound on the amount of entan- 
glement that they consume in this protocol. Otherwise, they could generate an infinite amount of common 
randomness. Also, the converse proof outlined here applies equally well if Alice chooses messages from a 
uniform random variable and tries to communicate this message to Bob. 
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Finally, Bob performs some decoding map x> BUTb ^ m on the above state to give 

(uj') Mlil = V B"T B ^M^MT B B«y (20.109) 



,MM 



If the protocol is e-good for common randomness generation, then the actual state (u/) " " re- 
sulting from the protocol should be e-close in trace distance to the ideal common randomness 
state: 



(u/) 



MM 



$ 



MM 



< e. 



(20.110) 



We now show that the quantum mutual information of the channel serves as an upper 
bound on the rate C of any reliable protocol for entanglement-assisted common randomness 



generation (a protocol meeting the error criterion in (20.110)). Consider the following chain 
of inequalities: 



nC = I(M;M)q 

</(M;M) w / + ne / 

<I(M-B n T B ) w + ne' 
= I(T B M;B n ) w + I(M;T B ) i 
= I(T B M;B n )^-I(B n ;T B ) 
<I(T B M;B n ) u + r^ 



IiB^Tn^ + ne' 



ne 



< max I(AX:B r 

p XAA' n 



ne 



(20.111) 
(20.112) 
(20.113) 
(20.114) 
(20.115) 
(20.116) 
(20.117) 



The first equality follows by evaluating the quantum mutual information of the common 



-MM 



randomness state $ . The first inequality follows from the assumption that the protocol 



satisfies the error criterion in (20.110) and by applying the Alicki-Fannes' inequality from 
Exercise 11.9.7 with e' = 6eC + 4i?2(e)/n. The second inequality follows from quantum data 
processing (Corollary 11.9.4) — Bob processes the state uo with the decoder T> to get the state 
uj' . The second equality follows from the chain rule for quantum mutual information (see 



Exercise 11.7.1). The third equality follows because the systems M and T B are in a product 
state, so I(M; Ts)^ = 0. The third inequality follows because I(B n ; T B ) 
the state u) MTbB ' 1 is a classical- quantum state of the form: 



> 0. Observe that 



XAB" 



^2px(x)\x)(x\ X ®J\f J 



,AA 



(pr), 



(20.118) 



where the classical system X in p XABn plays the role of M in u MTnBn and the quantum 
system A in p XABn plays the role of T B in uj MTbB ' 1 . Then the final inequality follows because 
the quantum mutual information I(T B M; B 71 )^ can never be greater than the maximum of 



I(AX; B n ) p over all input states of the form in (20.118). 

We can strengthen this converse proof considerably. First, observe that the most general 
form of an encoding is an arbitrary CPTP map £ M TA ^ An that acts on a classical register M' 
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and a quantum register Ta- From Section |4.4.8| > we know that this map takes the following 
form: 

£M'tw" (¥ mm' ^ ^ TaTb) = 1 J2\m)(m\ M ® ^^ An ($ T ^), (20.119) 



m 



where each £^ A ^ A " is a CPTP map. This particular form follows because the first register 
M' on which the map g MT A-^A n ac ^ g ^ g a c i ass ^ ca i register. Now, it would seem strange if 
performing a conditional noisy encoding £^ A ^ A " for each message m could somehow improve 
performance. So, we would like to prove that conditional noisy encodings can never outper- 
form conditional isometric (noiseless) encodings. In this vein, since Alice is in control of the 
encoder, we allow her to simulate the noisy encodings £^ A ^ An by acting with their isometric 
extensions U £ A ^ and tracing out the environments E' (to which she has access). Then 

the value of the quantum mutual information I(TbM; B n ) w is unchanged by this simulation. 
Now suppose instead that Alice performs a von Neumann measurement of the environment 
of the encoding and she places the outcome of the measurement in some classical register 
L. Then the quantum mutual information can only increase, a result that follows from the 
quantum data processing inequality: 

I(T B LM; B n ) w > I(T B M; B n ) u - (20-120) 

Thus, isometric encodings are sufficient for achieving the entanglement-assisted classical 
capacity. 

We can view this result in a less op eration al (and more purely mathematical) way as 
well. Consider a state of the form in ( J20.118 ). Suppose that each p AA ' n has a spectral 
decomposition 

/^' n = £py|*(y|*)^f, (20.121) 

y 
where the states i\) AA ™ are pure. We can define the following augmented state 

P™' = Y.Vx{x) VY \ x {y\x)\x){x\ x ® \y){y\ Y ®M A ' n ^ Bn {^f ), (20.122) 

x,y 

such that p XABn = TiY{p XYABn }- Then the quantum data processing inequality implies 
that 

I(AX; B n ) p < I(AXY; B n ) p . (20.123) 

By joining the classical Y register with the classical X register, the following equality 
holds 

max I (AX; B n ) p = max I (AX; B n ) a , (20.124) 

where 

a XABn = J2px(x)\x)(x\ x ®N A ' n ^ Bn (i> AA ' n ), (20.125) 

X 
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so that the maximization is over only pure states ij) A " . Then we know from the result of 
Exercise 112.4.11 that 

max I(AX:B n ) = maxI(A;B n ), (20.126) 

where the maximization on the RHS is over pure states (j) AAn . Finally, from additivity of 



the quantum mutual information of a quantum channel (Theorem 12.4.1) and an inductive 



argument similar to that in Corollary 12.1.1, the following equality holds 



max I(A-B n ) =nI{N). (20.127) 

4> AA ' n 

Thus, the bound on the classical rate C of a reliable protocol for entanglement-assisted 
common randomness generation is 

nC <nI(M) + ne', (20.128) 

and it also serves as an upper bound for entanglement-assisted classical communication. This 
demonstrates a single-letter upper bound on the entanglement-assisted classical capacity of 



a quantum channel and completes the proof of Theorem 20.3.1 



20.5.1 Feedback Does Not Increase Capacity 

The entanglement-assisted classical capacity formula is the closest formal analogy to Shan- 
non's capacity formula for a classical channel. The mutual information I(Af) of a quantum 
channel J\f is the optimum of the quantum mutual information over all bipartite input states: 

I(Af) = max I (A; B), (20.129) 

J.AA' 



and it is equal to the channel's entanglement-assisted classical capacity by Theorem 20.4.1 
The mutual information I(py\x) of a classical channel py\x is the optimum of the classical 
mutual information over all correlated inputs to the channel: 

l{p Y \x) =maxI{X;Y), (20.130) 

where XX' are correlated random variables with the distribution Px,x'{x,x') = Px(x)S XtX '. 
The formula is equal to the classical capacity of a classical channel by Shannon's noisy coding 
theorem. Both formulas not only appear similar in form, but they also have the important 
property of being "single-letter," meaning that the above formulas are equal to the capacity 
(this was not the case for the Holevo information from the previous chapter). 

We now consider another way in which the entanglement-assisted classical capacity is 
the best candidate for being the generalization of Shannon's formula to the quantum world. 
Though it might be surprising, it is well known that free access to a classical feedback channel 
from receiver to sender does not increase the capacity of a classical channel. We state this 
result as the following theorem. 
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Theorem 20.5.1 (Feedback does not increase classical capacity). The feedback capacity of 
a classical channel Py\x(v\ x ) is equal to the mutual information of that channel: 

sup{C : C is achievable with feedback } = I[py\x)i (20.131) 



where I(py\x) is defined in (20.130). 



Proof. We first define an (n, C — 5, e) classical feedback code as one in which every symbol 
Xi(m, Y l ~ l ) of a codeword x n (m) is a function of the message m e M. and all of the previous 
received values Y\, . . . , Yi-\ from the receiver. The decoder consists of the decoding function 
g: y n -> {1,2, ... , |A<f|} such that 

Pr{M' ^ M} < e, (20.132) 

where M' = g(Y n ). The lower bound LHS > RHS follows because we can always avoid the 
use of the feedback channel and achieve the mutual information of the classical channel by 
employing Shannon's noisy coding theorem. The upper bound LHS < RHS is less obvious, 
but it follows from the memoryless structure of the channel and the structure of a feedback 
code. Consider the following chain of inequalities: 

nC = H(M) (20.133) 

= I(M- M') + H(M | M') (20.134) 

< I(M; M') + 1 + enC (20.135) 

< I(M; Y n ) + 1 + enC. (20.136) 

The first equality follows because we assume that the message M is uniformly distributed. 



The first inequality follows from Fano's inequality (see Theorem 10.7.3) and the assumption 



in (20.132) that the protocol is good up to error e. The last inequality follows from classical 



data processing. Continuing, we can bound I(M; Y n ) from above: 

I(M-Y y 



= H(Y n ) - H(Y n \M) 






(20.137) 


= H(Y n )-J2H(Y k 
fc=i 


Y k ~ 


l M) 


(20.138) 


n 

= H(Y n )-J2H{Y k 
fc=i 


1 Y k ~ 


l MX k ) 


(20.139) 


n 

= H(Y n )-J2H(Y k 
fc=i 


x k ) 




(20.140) 


n 

<J2 H ( Y k)~H(Y k | 
fc=i 


x k ) 




(20.141) 


n 

= J2 I (x k ;Y k ) 
k— i 






(20.142) 


< nmax/(l; Y) 






(20.143) 



XX 
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The first equality follows from the definition of mutual information. The second equality fol- 



lows from the chain rule for entropy (see Exercise 10.3.2). The third equality follows because 
Xk is a function of Y k ~ l and M. The fourth equality follows because Yk is conditionally 
independent of Y k ~ l and M through Xk (Y k ~ l M — > Xk — > Yf. forms a Markov chain). The 
first inequality follows from subadditivity of entropy. The fifth equality follows by definition, 
and the final inequality follows because the individual mutual informations in the sum can 
never exceed the maximum over all inputs. Putting everything together, our final bound on 
the feedback- assisted capacity of a classical channel is 

C <l(pY\x) + - + eC, (20.144) 

n 

which becomes C < I(py\x) a s n — > oo and e — > 0. □ 

Given the above result, we might wonder if a similar result could hold for the entanglement- 
assisted classical capacity. Such a result would more firmly place the entanglement-assisted 
classical capacity as a good generalization of Shannon's coding theorem. Indeed, the follow- 
ing theorem states that this result holds. 

Theorem 20.5.2 (Quantum feedback does not increase the EAC capacity). The classi- 
cal capacity of a quantum channel assisted by a quantum feedback channel is equal to that 
channel's entanglement- assisted classical capacity: 

sup{C | C is achievable with quantum feedback} = I(Af), (20.145) 



where I (AT) is defined in (20.129). 



Proof. We define free access to a quantum feedback channel to mean that there is a noiseless 
quantum channel of arbitrarily large dimension going from the receiver Bob to the sender 
Alice. The bound LHS > RHS follows because Bob can use the quantum feedback channel 
to establish an arbitrarily large amount of entanglement with Alice. They then just execute 



the protocol from Section 20.4 to achieve a rate equal to the entanglement-assisted classical 



capacity. The bound LHS < RHS is much less obvious, and it requires a proof that is 



different from the proof of Theorem 20.5.1 We first need to determine the most general 



protocol for classical communication with the assistance of a quantum feedback channel. 



Figure 20.4 depicts such a protocol with Alice's systems in red, Bob's systems in blue, and 



the feedback systems in green. Alice begins by correlating the message M with quantum 



codewords of the form p^ , where the state p^ can be entangled across all of the channel 
uses. We describe the k th step of the protocol with quantum feedback: 

1. Bob receives the channel output B^. He acts with some unitary U k on Bk, his previously 
received systems B k ~ l , and two new registers Xk and Y&. We place no restriction on 
the size of the registers Xk and Y k . 

2. Bob transmits the system Xk through the noiseless feedback channel to Alice. 
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Figure 20.4: Three rounds of the most general protocol for classical communication with a quantum 
feedback channel. Alice's systems are in red, Bob's are in blue, and quantum feedback systems are in green. 
Alice has a message M classically correlated with the quantum systems A1A2A3. She acts with a unitary Vj^ 
on A1A2A3 and an ancilla Z\, depending on her message m. She transmits A\ through the noisy channel. 
Bob receives B\ from the channel and performs a unitary U 1 on B\, and two other registers X\ and Y±. He 
sends Xi through the quantum feedback channel to Alice, and Alice continues the encoding by exploiting 
the feedback system. This process continues in the k th round with Bob performing unitaries on his previous 
systems B k , and Alice performing unitaries on her current register A^ ■ ■ ■ A n and her feedback registers X k ~ 1 . 



3. Alice acts with a unitary V^ that depends on the message m. This unitary acts on her 
received system X&, all of the other systems Ak ■ ■ ■ A n in her possession, an ancilla Z^, 
and all of her previously processed feedback systems X h ~ x . 



4. She transmits the system Ak through the noisy quantum channel. 



This protocol is the most general for classical communication with quantum feedback because 
all of its operations are inclusive of previous steps in the protocol. Also, the most general 
operations could have CPTP maps rather than unitaries, but Alice and Bob can discard 
their ancilla systems in order to simulate such maps. We can now proceed with proving 
the upper bound LHS < RHS. To do so, we assume that the random variable M modeling 
Alice's message selection is a uniform random variable, and Bob obtains a random variable 
M' by measuring all of his systems B n at the end of the protocol. For any good protocol for 
classical communication, the bound Pr{M' 7^ M} < e applies. Consider the following chain 



of inequalities (these steps are essentially the same as those in (20.133 20.136)): 



nC 



= H(M) 
= I(M;M') 

< I(M-M') 

< I(M; B n ) 



H(M I M') 
1 + enC 
1 + tnC 



(20.146) 
(20.147) 
(20.148) 
(20.149) 



This chain of inequalities follows for the same reason as those in (20.133 20.136), with the 
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last step following from quantum data processing. Continuing, we have 

I(M; B n ) = I(M; B^B™' 1 ) + l(M; B n ~ l ) (20.150) 

< I(M; B n \B n ~ l ) + I(M; B^B 11 ' 2 ) + l(M; B n ~ 2 ) (20.151) 

n 

< Y^ I(M; B k \B k ~ l ) (20.152) 

k=i 

<n maxI(M;B\A) (20.153) 

p 

The first equality follows from the chain rule for quantum mutual information. The first 
inequality follows from quantum data processing and another application of the chain rule. 
The third inequality follows from recursively applying the same inequality. The final in- 
equality follows from considering that the state on systems M, B^, and B k ~ l is a particular 
state of the form: 

p mba = J2PM(m)\m){m\ M ®N A '^ B {pi A ), (20.154) 

in 

and so I(M ; Bk\B n ~ l ) can never be greater than the maximization over all states of this 
form. We can then bound max p I(M; B\A) by the quantum mutual information /(A/") of the 
channel: 

I(M;B\A) = H(B\A)-H(B\AM) (20.155) 

< H(B) - J2PM(m)H(B\A) pm (20.156) 

■m 

= H(B) + J2p M (m)H(B\E)^ m (20.157) 

rn 

= H(B) + H(B\EM) r , IBE (20.158) 

<H(B) + H(B\E)^ (20.159) 

= H{B)-H(B\A)+ (20.160) 

< I(Af) (20.161) 

The first equality follows from expanding the conditional quantum mutual information. 
The first inequality follows from subadditivity of entropy and expanding the conditional 
entropy H(B\AM) with the classical variable M. The second equality follows by taking tp^ l AE 
as a purification of the state p^ A and considering that —H(B\A) = H(B\E) for any tripartite 
pure state. The third equality follows by rewriting the convex sum of entropies as a condi- 
tional entropy with the classical system M and the state ^ MBAE = J2 m p(m)\m)(m\ <^>ip^ l AE - 
The second inequality follows because conditioning cannot increase entropy, and the fourth 
equality follows by taking cj) ABE as a purification of tf) BE . The final inequality follows by 
noting that H(B) — H(B\A) = I (A; B) and this quantum mutual information can never be 
greater than the maximum. Putting everything together, we get the following upper bound 
on any achievable rate C for classical communication with quantum feedback: 

C < UN) + - + eC, (20.162) 

n 
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which becomes C < /(A/") as n — > oo and e — > 0. D 

Corollary 20.5.1. TTte capacity of a quantum channel with unlimited entanglement and 
classical feedback is equal to the entanglement- assisted classical capacity of N '. 

Proof. This result follows because we have that I(Af) is a lower bound on this capacity 
(simply by avoiding use of the classical feedback channel). Also, I(J\f) is an upper bound 
on this capacity because the entanglement and classical feedback channel can simulate an 
arbitrarily large quantum feedback channel via teleportation, and the above theorem gives 
an upper bound of I(J\f) for this setting. □ 



20.6 Examples of Channels 



This section shows how to compute the entanglement-assisted classical capacity of both the 
quantum erasure channel and the amplitude damping channel, while leaving the capacity 
of the quantum depolarizing channel and the dephasing channel as exercises. For three 
of these channels (erasure, depolarizing, and dephasing), a super-dense-coding-like strategy 
suffices to achieve capacity. This strategy involves Alice locally rotating an ebit shared with 
Bob, sending half of it through the noisy channel, and Bob performing measurements in 
the Bell basis to determine what Alice sent. This process induces a classical channel from 
Alice to Bob, for which its capacity is equal to the entanglement-assisted capacity of the 
original quantum channel (in the case of depolarizing, dephasing, and erasure channels). 
For the amplitude damping channel, this super-dense-coding-like strategy does not achieve 
capacity — in general, it is necessary for Bob to perform a large, collective measurement on 
all of the channel outputs in order for him to determine Alice's message. 



Figure [2075] plots the entanglement-assisted capacities of these four channels as a function 
of their noise parameters. As expected, the depolarizing channel has the worst performance 
because it is a "worst-case scenario" channel — it either sends the state through or replaces 
it with a completely random state. The erasure channel's capacity is just a line of constant 
slope down to zero — this is because the receiver can easily determine the fraction of the time 
that he receives something from the channel. The dephasing channel eventually becomes a 
completely classical channel, for which entanglement cannot increase capacity beyond one 
bit per channel use. Finally, perhaps the most interesting curve is for the amplitude damping 
channel. This channel's capacity is convex when its noise parameter is less than 1/2 and 
concave when it is greater than 1/2. 

20.6.1 The Quantum Erasure Channel 

Recall that the quantum erasure channel acts as follows on an input density operator p A : 

p A ' ^(l-e)p B + e\e)(e\ B , (20.163) 

where e is the erasure probability and |e) is an erasure state that is orthogonal to the 
support of the input state p. 
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Figure 20.5: The entanglement-assisted classical capacity of the amplitude damping channel, the erasure 
channel, the depolarizing channel, and the dephasing channel as a function of each channel's noise parameter. 



Proposition 20.6.1. The entanglement-assisted classical capacity of a quantum erasure 
channel with erasure probability e is 



2(l-e)logd A , 

where d,A is the dimension of the input system. 



(20.164) 



Proof. To determine the entanglement-assisted classical capacity of this channel, we need 
to compute its mutual information. So, consider that sending half of a bipartite state (f) 
through the channel produces the output 



a 



AB 



(1 



lAB 



ecf) A <g> \e)(e\ 



(20.165) 



We could now attempt to calculate and optimize the quantum mutual information I (A; B) 
Though, observe that Bob can apply the following isometry JJ B ^ BX to his state: 



U 



B^BX 



X 



ir®|0)" + |e)(e 



|1} X , 



(20.166) 



where H B is a projector onto the support of the input state (for qubits, it would be just 
|0}(0| + |1)(1|). Applying this isometry leads to a state a ABX where 



a 



ABX 



u 

(1 



B^BX _AB i 



T B^BX\^ 



-a-' J (u B ^ BX y 

t)4> AB ® |0}(0| x + t4> A ® |e)(e| B ® |1}(1| 



(20.167) 
(20.168) 
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The quantum mutual information I (A; BX) a is equal to I(A; B) a because entropies do not 
change under the isometry U B ^ BX . We now calculate I(A; BX) a : 

I(A-BX) 



= H(A) a + H(BX) a -H(ABX) a 

= H(A)^ + H(B\X) a -H(AB\X) a 


(20.169) 
(20.170) 


= H(A)^ + (l-e)[H(B) tp -H(AB) t 




+ e[H(B) le) -H(AB)^ le){el 


(20.171) 


= H(A)+ + (1 - e)H(B)i - e [h{A\ + H{B\ e) _ 


(20.172) 


= (1 - e) [ff(A), + ff(B), 


(20.173) 


= 2{l-e)H{A) 4> 
<2{l-e)\ogd A . 


(20.174) 
(20.175) 



The first equality follows by the definition of quantum mutual information. The second 
equality follows from (j) A = Tr#x ■[ a ABX } , from the chain rule of entropy, and by canceling 
H{X) on both sides. The third equality follows because the X register is a classical register, 
indicating whether the erasure occurs. The fourth equality follows because H(AB), = 0, 
H(B),i = 0, and i7(AB) , A(gl | wi = H(A),+H(B), e y The fifth equality follows again because 
H(B)< e x = and by collecting terms. The final equality follows because H(A), = H(B), 
(4> is a pure bipartite state). The final inequality follows because the entropy of a state 
on system A is never greater than logarithm of the dimension of A. We can conclude that 
the maximally entangled state $ achieves the entanglement-assisted classical capacity of 
the quantum erasure channel because H(A)$ = logd^. □ 

The strategy for achieving the entanglement-assisted classical capacity of the quantum 
erasure channel is straightforward. Alice and Bob simply employ a super-dense coding 
strategy on all of the channel uses (this means that Bob performs measurements on each 
channel output with his share of the entanglement — there is no need for a large, collective 
measurement on all of the channel outputs). For a good fraction 1 — e of the time, this 
strategy works and Alice can communicate 2 log d,A bits to Bob. For the other fraction e, all 
is lost to the environment. In order for this to work, Alice and Bob need to make use of a 
feedback channel from Bob to Alice so that Bob can report which messages come through 



and which do not, but Corollary 20.5.1 states that this feedback cannot improve the capacity. 
Thus, the rate of communication they can achieve is equal to the capacity 2(1 — e) log cIa- 



20.6.2 The Amplitude Damping Channel 

We now compute the entanglement-assisted classical capacity of the amplitude damping 
channel A/ad- Recall that this channel acts as follows on an input qubit in state p: 



A/a D (p) = AopAl + A lP A\, 



(20.176) 
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where 



A =|0}(0| + v / r^|l)(l|, 



A, 



7|0)(1|- 



(20.177) 



Proposition 20.6.2. The entanglement-assisted classical capacity of an amplitude damping 
channel with damping parameter 7 is 



I{M AD ) = max H 2 (p) + H 2 ((l - 7 )p) - H 2 { 1P ), 
pe[o,i] 

where H 2 {p) = —p\ogp — (1 — p) log(l — p) is t/ie binary entropy function. 



(20.178) 



P 



Proof. Suppose that a matrix representation of the input qubit density operator p in the 
computational basis is 

1—P 7]* 

7] p 

One can readily verify that the density operator for Bob has the following matrix represen- 
tation: 

1 



(20.179) 



A/ad(p) 



1 



l)p \A - 777* 
(l-7)p 



(20.180) 



^ad(p) 



(20.181) 



\/l -7?? 

and by calculating the elements Tr{AjpA^}|i)(j|, we can obtain a matrix representation for 
Eve's density operator: 

1 - 7p ,/tY 

. \/7^ IP 

where A/^d i s the complementary channel to Eve. By comparing (20.180) and (20.181), we 
can see that the channel to Eve is an amplitude damping channel with damping parameter 
1 — 7. The entanglement- assisted classical capacity of A/ad is equal to its mutual information: 

/(A/ad) = max I(A; B) a , (20.182) 



where (j) AA is some pure bipartite input state and o AB = A/'ad(</ )j4a )• We need to determine 
the input density operator that maximizes the above formula as a function of 7. As it stands 
now, the optimization depends on three parameters: p, Ke{rj}, and Imj?]}. We can show that 



it is sufficient to consider an optimization over only p with t] = 0. The formula in (20.182) 
also has the following form: 



because 



/(A/ad) = max[//(p) + H(Af AD (p)) - H(Af c AB (p))], 
p 



I(A; B) a = H(A)+ + H(B) 9 - H(AB) a 

= H(A')^ + H(Af AB (p))-H(E) <J 

= H{p) + H{M A v{p))-H{M c Up)) 

= /mut(P) A/ad)- 



(20.183) 



(20.184) 
(20.185) 
(20.186) 
(20.187) 
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The three entropies in (20.183) depend only on the eigenvalues of the three density operators 



in (20.179 20.181), respectively, which are as follows: 



1± \J(l-2p) Z + 4H 



1± 



1± 



2(l- 7 )p) 2 + 4|r ? | 2 (l- 7 ) 



27p) +4|?]| 7 



(20.188) 
(20.189) 
(20.190) 



The above eigenvalues are in the order of Alice, Bob, and Eve. All of the above eigenvalues 
have a similar form, and their dependence on r] is only through its magnitude. Thus, it 
suffices to consider r] G R (this eliminates one parameter). Next, the eigenvalues do not 
change if we flip the sign of r] (this is equivalent to rotating the original state p by Z, to 
ZpZ), and thus, the mutual information does not change as well: 

/mutO,A/" A D) = Imut(Z P Z,Af AD ). (20.191) 

By the above relation and concavity of quantum mutual information in the input density 



operator (Theorem 12.4.2), the following inequality holds 

Imut(P,J^AD) = -[Imut(P,J^AD) + 4mt (ZpZ, A/"ad)] 
<Imnt(^(p + ZpZ),M AB ) 

= I mut (A(p),Af AD ), 



(20.192) 

(20.193) 
(20.194) 



where A is a completely dephasing channel in the computational basis. This demonstrates 
that it is sufficient to consider diagonal density operators p when optimizing the quantum 



mutual information. Thus, the eigenvalues in (20.188 20.190) respectively become 



{p-A-p}, 

{(1- 7 )P,1-(1-7M, 
{7P, 1 - 7p}, 

giving our final expression in the statement of the proposition. 

Exercise 20.6.1 Consider the qubit depolarizing channel: p - 
its entanglement-assisted classical capacity is equal to 



(20.195) 
(20.196) 
(20.197) 

□ 
(1 — p)p + pn. Prove that 



2 + (1 - 3p/4) log(l - 3p/4) + (3p/4) log(p/4). 



(20.198) 



Exercise 20.6.2 Consider the dephasing channel: p — ► (1 —p/2)p+ (p/2)ZpZ. Prove that 
its entanglement-assisted classical capacity is equal to 2 — H 2 (p/2), where p is the dephasing 
parameter. 
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20.7 Concluding Remarks 



Shared entanglement has the desirable property of simplifying quantum Shannon theory. The 
entanglement-assisted capacity theorem is one of the strongest known results in quantum 
Shannon theory because it states that the quantum mutual information of a channel is 
equal to its entanglement-assisted capacity. This function of the channel is concave in the 
input state and the set of input states is convex, implying that finding a local maximum is 
equivalent to finding a global one. The converse theorem demonstrates that there is no need 
to take the regularization of the formula — strong subadditivity guarantees that it is additive. 
Furthermore, feedback does not improve this capacity, just as it does not for the classical 
case of Shannon's setting. In these senses, the entanglement-assisted classical capacity is the 
most natural generalization of Shannon's capacity formula to the quantum setting. 

The direct coding part of the capacity theorem exploits a strategy similar to super-dense 
coding — effectively the technique is to perform super-dense coding in the type class subspaces 
of many copies of a shared entangled state. This strategy is equivalent to super-dense coding 
if the initial shared state is a maximally entangled state. The particular protocol that 
we outlined in this chapter has the appealing feature that we can easily make it coherent, 
similar to the way that coherent dense coding is a coherent version of the super-dense coding 
protocol. We take this approach in the next chapter and show that we can produce a whole 
host of other protocols with this technique, eventually leading to a proof of the direct coding 
part of the quantum capacity theorem. 

This chapter features the calculation of the entanglement-assisted classical capacity of 
certain channels of practical interest: the depolarizing channel, the dephasing channel, the 
amplitude damping channel, and the erasure channel. Each one of these channels has a 
single parameter that governs its noisiness, and the capacity in each case is a straightforward 
function of this parameter. One could carry out a similar type of analysis to determine the 
entanglement-assisted capacity of any channel, although it generally will be necessary to 
employ techniques from convex optimization. 

Unfortunately, quantum Shannon theory only gets more complicated from here onwardjj 
For the other capacity theorems that we will study, such as the private classical capacity 
or the quantum capacity, the best expressions that we have for them are good only up 
to regularization of the formulas. In certain cases, these formulas completely characterize 
the capabilities of the channel for these particular operational tasks, but these formulas 
are not particularly useful in the general case. One important goal for future research in 
quantum Shannon theory would be to improve upon these formulas, in the hopes that we 
could further our understanding of the best strategy for achieving the information processing 
tasks corresponding to these other capacity questions. 



3 We could also view this "unfortunate" situation as being fortunate for conducting open-ended research 
in quantum Shannon theory. 
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20.8 History and Further Reading 

Adami and Cerf figured that the mutual information of a quantum channel would play an 
important role in quantum Shannon theory, and they proved several of its most important 
properties [S]. Bennett et al. later demonstrated that the quantum mutual information of a 
channel has the operational interpretation as its entanglement-assisted classical capacity [33j 
134] . Our proof of the direct part of the entanglement-assisted classical capacity theorem is the 
same as that in Ref . |156] . We exploit this approach because it leads to all of the results in the 
next chapter, implying that this protocol is sufficient to generate all of the known protocols in 
quantum Shannon theory (with the exception of private classical communication). Giovanetti 
and Fazio determined several capacities of the amplitude damping channel [102], and Perez- 
Garcia and Wolf made some further observations regarding it |262j . Bowen et al. proved 
that the classical capacity of a channel assisted by unbounded quantum feedback is equal to 
its entanglement-assisted classical capacity |42| 
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CHAPTER 21 



Coherent Communication with Noisy 

Resources 



This chapter demonstrates the power of both coherent communication from Chapter [7] and 
the particular protocol for entanglement-assisted classical coding from the previous chapter. 
Recall that coherent dense coding is a version of the dense coding protocol in which the 
sender and receiver perform all of its steps coherently]^] Since our protocol for entanglement- 
assisted classical coding from the previous chapter is really just a glorified dense coding 
protocol, the sender and receiver can perform each of its steps coherently, generating a 
protocol for entanglement-assisted coherent coding. Then, by exploiting the fact that two 
coherent bits are equivalent to a qubit and an ebit, we obtain a protocol for entanglement- 
assisted quantum coding that consumes far less entanglement than a naive strategy would 
in order to accomplish this task. We next combine this entanglement-assisted quantum 



coding protocol with entanglement distribution (Section 6.2.1) and obtain a protocol for 



which the channel's coherent information (Section 12.5) is an achievable rate for quantum 



communication. This sequence of steps demonstrates an alternate proof of the direct part 



of the quantum channel coding theorem in Chapter 23 

Entanglement-assisted classical communication is one generalization of super-dense cod- 
ing, in which the noiseless qubit channel becomes an arbitrary noisy quantum channel while 
the noiseless ebits remain noiseless. Another generalization of super-dense coding is a pro- 
tocol named noisy super- dense coding, in which the shared entanglement becomes a shared 
noisy state p AB and the noiseless qubit channels remain noiseless. Interestingly, the proto- 
col that we employ in this chapter for noisy super-dense coding is essentially equivalent to 
the protocol from the previous chapter for entanglement-assisted classical communication, 
with some slight modifications to account for the different setting. We can also construct a 
coherent version of noisy super-dense coding, leading to a protocol that we name coherent 
state transfer. Coherent state transfer accomplishes not only the task of generating coherent 
communication between Alice and Bob, but it also allows Alice to transfer her share of the 



Performing a protocol coherently means that we replace conditional unitaries with controlled unitaries 



and measurements with controlled gates (e.g., see Figures 6.2 and 7.3) 
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state p to Bob. By combining coherent state transfer with both the coherent communica- 
tion identity and teleportation, we obtain protocols for quantum-assisted state transfer and 
classical-assisted state transfer, respectively. The latter protocol gives an operational inter- 
pretation to the quantum conditional entropy H(A\B) — if it is positive, then the protocol 
consumes entanglement at the rate H(A\B) , and if it is negative, the protocol generates 
entanglement at the rate \H(A\B) \. 

The final part of this chapter shows that our particular protocol for entanglement-assisted 
classical communication is even more powerful than suggested in the first paragraph. It 
allows for a sender to communicate both coherent bits and incoherent classical bits to a 
receiver, and they can trade off these two resources against one another. The structure of 
the entanglement-assisted protocol allows for this possibility, by taking advantage of Re- 



mark 20.4.1 and by combining it with the HSW classical communication protocol from 



Chapter 19 Then, by exploiting the coherent communication identity, we obtain a protocol 



for entanglement-assisted communication of classical and quantum information. Chapter [24 
demonstrates that this protocol, teleportation, super-dense coding, and entanglement distri- 
bution are sufficient to accomplish any task in dynamic quantum Shannon theory involving 
the three unit resources of classical bits, qubits, and ebits. These four protocols give a 
three-dimensional achievable rate region that is the best known characterization for any in- 
formation processing task that a sender and receiver would like to accomplish with a noisy 



channel and the three unit resources. Chapter 24 discusses this triple trade-off scenario in 
full detail. 



21.1 Entanglement- Assisted Quantum Communication 

The entanglement-assisted classical capacity theorem states that the quantum mutual infor- 
mation of a channel is equal to its capacity for transmitting classical information with the 



help of shared entanglement, and the direct coding theorem from Section 20.4 provides a 



protocol that achieves the capacity. We were not much concerned with the rate at which this 
protocol consumes entanglement, but a direct calculation reveals that it consumes H(A) 

ebits per channel use, where \<p) is the bipartite state that they share before the protocol 
begins q 

Suppose now that Alice is interested in exploiting the channel and shared entanglement 
in order to transmit quantum information to Bob. There is a simple (and as we will see, 



naive) way that we can convert the protocol in Section 20.4 to one that transmits quantum 
information: they can just combine it with teleportation. This naive strategy requires con- 
suming ebits at an additional rate of \l(A; B) in order to have enough entanglement to 
combine with teleportation, where p AB = J\f A ^ B (ip AA ). To see this, consider the following 



This result follows because they can concentrate n copies of the state \(p) to nH(A) ebits, as we 
Also, they can "dilute" nH(A) ebits to n copies of \ip) with the help of a sublinear 



18 



learned in Chapter 

amount of classical communication that does not factor into the resource count (we have not studied the 

protocol for entanglement dilution) . 
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resource inequalities: 

W + (h(A) p + l -I{A; B))j [qq] > I(A; B) p [c - c] + \l{A- B) p [qq] (21.1) 

>h(A;B) p [q^q]. (21.2) 

The first inequality follows by having them exploit the channel and the nH(A) ebits to 
generate classical communication at a rate I(A; B) (while doing nothing with the extra 
n^I(A; B) ebits). Alice then exploits the ebits and the classical communication in a tele- 
portation protocol to send n\l(A\ B) qubits to Bob. This rate of quantum communication 



is provably optimal — were it not so, it would be possible to combine the protocol in (21.1 ■ 



21.2) with super-dense coding and beat the optimal rate for classical communication given 
by the entanglement-assisted classical capacity theorem. 

Although the above protocol achieves the entanglement-assisted quantum capacity, we 
are left thinking that the entanglement consumption rate of H (A) + \l(A; B) ebits per 
channel use might be a bit more than necessary because teleportation and super-dense coding 
are not dual under resource reversal. That is, if we combine the protocol with super-dense 
coding and teleportation ad infinitum, then it consumes an infinite amount of entanglement. 
In practice, this "back and forth" with teleportation and super-dense coding would be a poor 
way to consume the precious resource of entanglement. 

How might we make more judicious use of shared entanglement? Recall that coherent 
communication from Chapter [7] was helpful for doing so, at least in the noiseless case. A 
sender and receiver can combine coherent teleportation and coherent dense coding ad infini- 
tum without any net loss in entanglement, essentially because these two protocols are dual 
under resource reversal. The following theorem shows how we can upgrade the protocol in 



Section [20.4| to one that generates coherent communication instead of just classical commu- 
nication. The resulting protocol is one way to have a version of coherent dense coding in 
which one noiseless resource is replaced by a noisy one. 

Theorem 21.1.1 (Entanglement- Assisted Coherent Communication). The following re- 
source inequality corresponds to an achievable protocol for entanglement- assisted coherent 
communication over a noisy quantum channel: 

(M) + H(A) p [qq] > 1(A) B) p [q - qq], (21.3) 

where p AB = Af A '^ B (<p AA '). 

Proof. Suppose that Alice and Bob share many copies of some pure, bipa rtite entangled 
state \<p) . Consider the code from the direct coding theorem in Section 20.4 We can 
say that it is a set of D 2 « 2 n *• ' 'p unitaries U(s(m)), from which Alice can select, and 
she applies a particular unitary U(s(m)) to her share A n of the entanglement in order to 
encode message m. Also, Bob has a detection POVM {A^" 5 "} acting on his share of the 
entanglement and the channel outputs that he can exploit to detect message m. Just as we 
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were able to construct a coherent super-dense coding protocol in Chapter [7] by performing 
all the steps in dense coding coherently, we can do so for the entanglement-assisted classical 



coding protocol in Section |20.4[ We track the steps in such a protocol. Suppose Alice shares 
a state with a reference system R to which she does not have access: 



D 2 



,-RAi 



22 ai,m\l) R \ m ) Al , 



(21.4) 



!,m=l 



where {\l)} and {|m}} are some orthonormal bases for R and A%, respectively. We say 
that Alice and Bob have implemented a coherent channel if they execute the map \m) x — > 
\m) 1 \m) 1 , which transforms the above state to 



d- 



l,m=l 



\i\R\ \Ai\ \Bi 

<xi,m\l) \m) \m) \ 



(21.5) 



We say that they have implemented a coherent channel approximately if the state resulting 
from the protocol is e-close in trace distance to the above state. If we can show that e is an 
arbitrary positive number that approaches zero in the asymptotic limit, then the simulation 
of an approximate coherent channel asymptotically becomes an exact simulation. Alice's 
first step is to append her shares of the entangled state \ip) to |-0) x and apply the 

following controlled unitary from her system A\ to her system A n : 

>|m)(m| 1 ®U(s(m)) 
in 

The resulting global state is as follows: 

J2^ m \l) R \m)^U(s(m)) An \cp) AnBn . 

Lin 



(21.6) 



(21.7) 



By the structure of the unitaries U(s(m)) (see (20.46) and (20.48)), the above state is 
equivalent to the following one: 



Lin 



B n , 



az, m |0Vr([/ T ( S (m))r \<f) 



A n B n 



(21.8) 



Interestingly, observe that Alice applying the controlled gate in (21.6) is the same as her 
applying the nonlocal controlled gate J2 m \iTi) (m\ 1 <S) (U T (s(m))) , due to the nonlocal (and 
perhaps spooky!) properties of the entangled state \<p) . Alice then sends her systems 

A n through many uses of the noisy quantum channel J\f A ^ B , whose isometric extension is 
Ufs . Let \(p) '' ' denote the state resulting from the isometric extension Ufr of 

the channel acting on the state \(p) : 

,B' n E n B n _ T rA n ^B' n E r 



WY 



u 



AT 



W) 



A n B" 



(21.9) 
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After Alice transmits through the channel, the state becomes 



(21.10) 



l.m 



where Bob now holds his shares B n of the entanglement and the channel outputs B' 



(Observe that the action of the controlled unitary in (21.6) commutes with the action of the 



channel.) Rather than perform an incoherent measurement with the POVM (A^'" 5 ™}, Bob 
applies a coherent gentle measurement (see Section 5.4), an isometry of the following form: 

-B' n B" 



£A 



\m) . 



21.11 



Using the result of Exercise 5.4.1, we can readily check that the resulting state is 2-^/e-close 
in trace distance to the following state: 



Y,^ m \l) R \m) M {U T {s{m))) Bn \^Y 



I \Bi 



(21.12) 



l.m 



Thus, for the rest of the protocol, we pretend as if they are acting on the above state. Alice 
and Bob would like to coherently remove the coupling of their index m to the environment, 
so Bob performs the following controlled unitary: 



y Jm)(m| 



(U*(s(m))) B \ 



(21.13) 



and the final state is 

D 1 



R\ \Aii \B' n E n B n 



J^ a l>m \l) n \m) Al \(p) 

Lm=l 



m) 



By 



D 2 



^ ai >m \l) R \m) Al \m) 



Bl 



®\<p) B>nEnBn . (21.14) 



\l,m=l 



Thus, this protocol implements a D 2 - dimensional coherent channel up to an arbitrarily small 
error, and we have shown that the resource inequality in the statement of the theorem holds. 
Figure [2 1 . 1 1 depicts the entanglement-assisted coherent coding protocol. □ 



It is now a straightforward task to convert the protocol from Theorem |21.1.1| into one for 
entanglement-assisted quantum communication, by exploiting the coherent communication 



identity from Section 7.5 



Corollary 21.1.1 (Entanglement-Assisted Quantum Communication). The following re- 
source inequality corresponds to an achievable protocol for entanglement-assisted quantum 
communication over a noisy quantum channel: 



(N) + \l{A- E) v [qq] > \l{A- B) v [q - q], 



(21.15) 



where \(p) 



ABE 



T A'^BE\,„\AA' 



tjjy^ije^aa an( ^ jja^be ^ an j iSorne f r j /C extension of the channel M A ^ B . 
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Reference 




Figure 21.1: The protocol for entanglement-assisted coherent communication. Observe that it is the 
coherent version of the protocol for entanglement-assisted classical communication, just as coherent dense 



coding is the coherent version of super-dense coding (compare this figure and Figure 20.3 with Figures 6.2 



and 7.3 I. Instead of applying conditional unitaries, Alice applies a controlled unitary from her system A\ to 



her share of the entanglement and sends the encoded state through many uses of the noisy channel. Rather 
than performing a POVM, Bob performs a coherent gentle measurement from his systems B' n and B n to 
an ancilla £>i . Finally, he applies a similar controlled unitary in order to decouple the environment from the 
state of his ancilla B\ . 



Consider the coherent communication identity from Section 7.5 This identity states 
that a D 2 -dimensional coherent channel can perfectly simulate a D-dimensional quantum 
channel and a maximally entangled state |<J>) ' with Schmidt rank D. In terms of cobits, 
qubits, and ebits, the coherent communication identity is the following resource equality for 
D-dimensional systems: 



2\ogD[q^ qq] = logD[q^ q] + log D[qq\. 
Consider the following chain of resource inequalities: 

(Af) + H(A)^[qq]>I(A;B) v [q^qq] 
> 1 -I(A;B) v [q^q] 



-I{A-B) v [qq] 



(21.16) 

(21.17) 
(21.18) 



The first resource inequality is the statement of Theorem 21.1.1. and the second resource 



inequality follows from an application of coherent teleportation. If we then allow for cat- 
alytic protocols, in which we allow for some use of a resource with the demand that it be 
returned at the end of the protocol, we have a protocol for entanglement-assisted quantum 
communication: 

(A/"} + \l{A- E)\qq] > \l{A- B)\q - q], (21.19) 



because H(A) 



\IiA\B) v 



\I(A\E) (see Exercise 11.6.6). 



When comparing the entanglement consumption rate of the naive protocol in (21.1 21.2) 



with that of the protocol in Theorem |21.1.1[ we see that the former requires an additional 
ebits per channel use. Also, Theorem 



I(A;B) 



21.1.1 



leads to a simple proof of the achiev- 



ability part of the quantum capacity theorem, as we see in the next section. 
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Exercise 21.1.1 Suppose that Alice can obtain the environment E of the channel U^-^ BE . 
Such a channel is known as a coherent feedback isometry. Show how they can achieve the 
following resource inequality with the coherent feedback isometry Ufy^ BE : 



{u$- 



>BE 



) > -I(A-B)\q - q] + -I(E;B)\qq], 



(21.20) 



where \ip) 



ABE 



TjA'- 



-BE 



\<p) and p 



A' 



Tr^jy? }• This protocol is a generalization of 



coherent teleportation from Section 7.4 because it reduces to coherent teleportation in the 



case that Ufy^ BE is equivalent to two coherent channels. 



21.2 Quantum Communication 



We can obtain a protocol for quantum communication simply by combining the protocol from 



Theorem |2 1 . 1 . 1 1 further with entanglement distribution. The resulting protocol again makes 
catalytic use of entanglement, in the sense that it exploits some amount of entanglement 
shared between Alice and Bob at the beginning of the protocol, but it generates the same 
amount of entanglement at the end, so that the net entanglement consumption rate of the 
protocol is zero. The resulting rate of quantum communication turns out to be the same as 



we find for the quantum channel coding theorem in Chapter 23 (though the protocol given 
there does not make catalytic use of shared entanglement). 

Corollary 21.2.1 (Quantum Communication). The coherent information Q{H) is an achiev- 
able rate for quantum communication over a quantum channel N . That is, the following 
resource inequality holds 

{M)>Q{N)[q^ql (21.21) 

where Q{M) = m&x v I(A)B) p and p AB = M A '^ B {^ AA '). 

Proof. If we further combine the entanglement-assisted quantum communication protocol 
from Theorem 21.1.1 with entanglement distribution at a rate \l(A; E) , we obtain the 
following resource inequalities: 

1 



(M) + -I(A-E)\qq\ 



> 



> 



l r 

2 

l 



2 
I(A;B) 

I(A-B) 



I(A;E) p 
I(A;E) p 



[q 



q) 
q] 



I(A;E) p [q^q) 
I(A;E) [qq], 



which after resource cancelation, becomes 



(M)>I(A)B) p [q^q), 
I(A; B) - I (A; E) ] (see Exercise 



(21.22) 
(21.23) 

(21.24) 



because I(A)B) = ^ I (A; B) — I (A; E) (see Exercise 11.6.6). They can achieve the 
coherent information of the channel simply by generating codes from the state ip AA that 



maximizes the channel's coherent information. 



□ 
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21.3 Noisy Super-Dense Coding 



Recall that the resource inequality for super-dense coding is 

[q^q] + [qq]>2[c^c]. (21.25) 

The entanglement-assisted classical communication protocol from the previous chapter is 
one way to generalize this protocol to a noisy setting, simply by replacing the noiseless qubit 



channels in (21.25) with many uses of a noisy quantum channel. This replacement leads 
to the setting of entanglement-assisted classical communication presented in the previous 
chapter. 

Another way to generalize super-dense coding is to let the entanglement be noisy while 
keeping the quantum channels noiseless. We allow Alice and Bob access to many copies of 
some shared noisy state p AB and to many uses of a noiseless qubit channel with the goal 
of generating noiseless classical communication. One might expect the resulting protocol to 
be similar to that for entanglement-assisted classical communication, and this is indeed the 
case. The resulting protocol is known as noisy super-dense coding: 

Theorem 21.3.1 (Noisy Super-Dense Coding). The following resource inequality corre- 
sponds to an achievable protocol for quantum- assisted classical communication with a noisy 
quantum state: 

(p AB ) + H(A) p [q - g] > I (A, B) p [c - c], (21.26) 

where p AB is some noisy bipartite state that Alice and Bob share at the beginning of the 
protocol. 

Proof. The proof of the existence of a protocol proceeds similarly to the proof of Theo- 
rem |20.4.1[ with a few modifications to account for our different setting here. We simply 



need to establish a way for Alice and Bob to select a code randomly, and then we can invoke 



the Packing Lemma (Lemma 15.3.1) to establish the existence of a detection POVM that 



Bob can employ to detect Alice's messages. The method by which they select a random 



code is exactly the same as they do in the proof of Theorem |20.4.1[ and for this reason, we 
only highlight the key aspects of the proof. First consider the state p AB , and suppose that 
\<p) is a purification of this state, with R a reference system to which Alice and Bob do 
not have access. We can say that the state \<p) arises from some isometry U$^ BR acting 
on system A' of a pure state \<p) , so that \<p) is defined by \<p) = Ufy^ BR \(p) 
We can also then think that the state p AB arises from sending the state \ip) through a 
channel J\f A ^ B , obtained by tracing out the environment R of Uj^-^ BR . Our setting here 
is becoming closer to the setting in the proof of Theorem 120.4.11 and we now show how it 

_ ... AA'.k — ' 



becomes nearly identical. Observe that the state (\<p) )® n admits a type decomposition, 
similar to the type decomposition in (20.40j|2~0.43): 



(|^')®n = \ Vp(t)|$ t )— ". (21.27) 
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Similarly, we can write l\(p) ) 



as 



(\<p) ABR r n =J2^p&\^ 



A n | g» fi n 



(21.28) 



where the vertical line in A n | B n R n indicates the bipar tite cu t between systems A n and 



\A 



B n R n . Alice can select a unitary U(s) of the form in (20.46) uniformly at random, and 



the expected density operator with respect to this random choice of unitary is 

^ nfln =E s {u(S) An p AnBn U^S) An } 



P 



YtrtX®*' 



A' 



(O, 



(21.29) 
(21.30) 



by exploiting the development in (20.82 20.92). For each message m that Alice would like 



to send, she selects a vector s of the form in (20.47) uniformly at random, and we can write 



s(m) to denote the explicit association of the vector s with the message m after Alice makes 
the assignment. This leads to quantum-assisted codeworddjof the following form: 



U{s(m)) A p AnBn U\s{m)) 



A" 



(21.31) 



We would now like to exploit the Packing Lemma (Lemma 15.3.1), and we require message 



subspace projectors and a total subspace projector in order to do so. We choose them 
respectively as 



u( S ruf} Bn uH S r, 



n P ,s 



jtB" 



(21.32) 
(21.33) 



where U A ^ Bn , IL^, and 11^ are typical projectors for p AnBn , p A ™ 5 and p B ", respectively. 
The following four conditions for the Packing Lemma hold, for the same reasons that they 



hold in (20.68 20.71): 



Tr{ (n£ ® n£) (u{s) An P AnBn u\s) An ) } > i - e , 

Tr{ (u(srn A ;ru\s) An ) lu{s) An p AnBn U\s) An ) } > 1 - e, 

Tr{u(s) An U A ; Bn U^s) A } < 2 n l H ^ f 



+cS 



(21.34) 
(21.35) 
(21.36) 



( n p,5 ® n p,,5JP ( n p,<5® n p,5J 



< 2 -n[H(A) p +H(B) p - V (n,S)-cS] ^ g, j^ (21 37) 



3 We say that the codewords are "quantum-assisted" because we will allow the assistance of quantum 
communication in transmitting them to Bob. 
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A'B 




Figure 21.2: The protocol for noisy super-dense coding that corresponds to the resource inequality in 



Theorem 21.3.1 Alice first projects her share into its typical subspace (not depicted). She then applies 
a unitary encoding U(s(m)), based on her message m, to her share of the state p B . She compresses 
her state to approximately nH(A) qubits and transmits these qubits over noiseless qubit channels. Bob 
decompresses the state and performs a decoding POVM that gives Alice's message m with high probability. 



where c is some positive constant and rj(n,5) is a function that approaches zero as n — > oo 
and 5 — > 0. Let us assume for the moment that Alice simply sends her A n systems to 
Bob with many uses of a noiseless qubit channel. It then follows from Corollary |15.5.1 



(the derandomized version of the Packing Lemma) that there exists a code and a POVM 
(A^" 5 "} that can detect the transmitted codewords of the form in (21.31) with arbitrarily 



low maximal probability of error, as long as the size \A4\ of Alice's message set is small 
enough: 



p* e = maxTr{(/ - Af Bn )U(s(m)) Bn p AnBn U*(s(m)) Bn } 

< 4(e + 2>/e) + 16 • 2 - n [ H ^ P +H ^p-^ n ^- cS ]2 n [ H{AB) P +c5 } \M\ 
= 4(e + 2^) + 16 • 2 -*[ / (* b >p-"(».*)-H|A1|. 



(21.38) 

(21.39) 
(21.40) 



So, we can choose the size of the message set to be \A4\ = 2 n ^ j4;B - ) ^{nfi) 3c<5] go ^at ^ e ra ^ e 
C of classical communication is 



C = -\og 2 \M\ = I(A;B) P - r)(n,5) - 3cS, 



ii 



and the bound on the maximal probability of error becomes 

p* e <4(e + 2Ve) + 16 • 2~ nc& . 



(21.41) 



(21.42) 



Since e is an arbitrary positive number that approaches zero for sufficiently large n and 5 
is a positive constant, the maximal probability of error vanishes as n becomes large. Thus, 
the quantum mutual information I(A; B) , with respect to the state p AB is an achievable 
rate for noisy super-dense coding with p. We now summarize the protocol (with a final 
modification). Alice and Bob begin with the state p A ' lBn . Alice first performs a typical 
subspace measurement of her system A n . This measurement succeeds with high probability 
and reduces the size of her system A n to a subspace with size approximately equal to nH(A) 



A" 



qubits. If Alice wishes to send message m, she applies the unitary U(s(m)) to her share 
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of the state. She then performs a compression isometry from her subspace of A n to nH{A) 
qubits. She transmits her qubits over nH(A) noiseless qubit channels, and Bob receives 
them. Bob performs the decompression isometry from the space of nH(A) noiseless qubits 
to a space isomorphic to Alice's original systems A n . He then performs the decoding POVM 
{A^ nfin } and determines Alice's message m with vanishingly small error probability. Note: 
The only modification to the protocol is the typical subspace measurement at the beginning, 
and one can readily check that this measurement does not affect any of the conditions in 
(121. 341121. 371). Figure 121.21 depicts the protocol. □ 



21.4 State Transfer 



We can also construct a coherent version of the noisy super-dense coding protocol, in a 



manner similar to the way in which the proof of Theorem |21.1.1| constructs a coherent 
version of entanglement-assisted classical communication. Though, the coherent version of 
noisy super-dense coding achieves an additional task: the transfer of Alice's share of the 
state (p AB )® n to Bob. The resulting protocol is known as coherent state transfer, and from 
this protocol, we can derive a protocol for quantum-communication-assisted state transfer, 
or quantum-assisted state transfer]^] for short. 

Theorem 21.4.1 (Coherent State Transfer). The following resource inequality corresponds 
to an achievable protocol for coherent state transfer with a noisy state p AB : 

(W S ^ AB : p S ) + H{A) p [q -^q]> I(A; B) p [q - qq] + (I S ^ BB : /}, (21.43) 

where p AB is some noisy bipartite state that Alice and Bob share at the beginning of the 
protocol. 



The resource inequality in (21.43) features some notation that we have not seen yet. The 



expression (W S ^ AB : p s ) means that a source party S distributes many copies of the state 
p s to Alice and Bob, by applying some isometry W S ~* AB to the state p s . This resource is 
effectively equivalent t o Alice and Bob sharing many copies of the state p AB , 



a resource we 



expressed in Theorem 



21.3.1 



as (p 



AB^ 



The expression (7 



S^BB 



p s ) means that a source 



party applies the identity map to p s and gives the full state to Bob. We can now state the 
meaning of the resource inequality in (21.43): Using n copies of the state p AB and nH(A) 



noiseless qubit channels, Alice can simulate nI(A;B) noiseless coherent channels to Bob 



while at the same time transferring her share of the state (p 



AB\ 



to him. 



21.1.1 



Let \<p) be a pu- 



Proof. The proof proceeds similarly to the proof of Theorem 

rification of p AB . Alice begins with a state that she shares with a reference system i?i, on 

which she would like to simulate coherent channels: 



D 2 



,fliAi 



y~] a^ m \l) Rl \m) J 



(21.44) 



l,; 



=1 



4 This protocol goes by several other names in the quantum Shannon theory literature: state transfer, 
fully-quantum Slepian-Wolf, state merging, and the merging mother. 
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where D 2 « 2 nI{A '' B) ?. She appends \^) RlM to \^) AnBnRn = (\^) ABR )® n and applies a typical 
subspace measurement to her system A n . (In what follows, we use the same notation for the 
typical projected state because the states are the same up to a vanishingly small error). She 
applies the following controlled unitary to her systems AiA n : 

J2\m)(m\ Al ®U(s(m)) A '\ (21.45) 

in 

resulting in the overall state: 

^i, m \l) R ^m)^U( S (m)) An \ V ) AnBnRn . (21.46) 

l,m 

Alice compresses her A n systems, sends them over nH(A) noiseless qubit channels, and Bob 
receives them. He decompresses them and places them in systems B n isomorphic to A n . The 

An on rtn a 

resulting state is the same as \ip) , with the systems A n replaced by B n . Bob performs 

a coherent gentle measurement of the following form: 

-B n B n 



J2VK>. ®\m) B \ (21.47) 

Til 

resulting in a state that is close in trace distance to 

^a^lO^lm}^^} 5 ^^^))^^}^^". (21.48) 

Lin 

He finally performs the controlled unitary 

J2\ m )( m \ Bl ® U\s{m)) B \ (21.49) 

in 

resulting in the state 

(j2*i, m \l) Rl H Al W B >) ® Wr BnRn . (21.50) 

V l,m J 

Thus, Alice has simulated nI(A; B) coherent channels to Bob with arbitrarily small error, 



'p 

,A n B n R" 



while also transferring her share of the state \ip) " " to him. Figure 21.3 depicts the 
protocol. □ 

We obtain the following resource inequality for quantum-assisted state transfer, by com- 
bining the above protocol with the coherent communication identity: 

Corollary 21.4.1 (Quantum-Assisted State Transfer). The following resource inequality 
corresponds to an achievable protocol for quantum- assisted state transfer with a noisy state 
p AB : 

(W S ^ AB : p S ) + \l(A; R) v [q - q] > h{A- B) v [qq] + {I S ^ BB : p s ), (21.51) 

where p AB is some noisy bipartite state that Alice and Bob share at the beginning of the 
protocol, and \cp) is a purification of it. 
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Reference 




Reference \JL 



Figure 21.3: The protocol for coherent state transfer, a coherent version of the noisy super-dense coding 
protocol that accomplishes the task of state transfer in addition to coherent communication. 



Proof. Consider the following chain of resource inequalities: 

(W s - AB :p s ) + H(A) v [q q ] 

>I(A-B)\q^ qq ] + (I s ^ B :p s ) 



> \l(A; B) v [q ^q) + \l{A- B) v [qq] + {I S ^ BB : p s ), 



(21.52) 
(21.53) 



where the first follows from coherent state transfer and the second follows from the coherent 
communication identity. By resource cancelation, we obtain the resource inequality in the 
statement of the theorem because \l(A; R) = H(A) — \l(A; B) . □ 

Corollary 21.4.2 (Classical- Assisted State Transfer). The following resource inequality cor- 
responds to an achievable protocol for classical-assisted state transfer with a noisy state p : 



(W 



S^AB 



: p s ) + I{A- R)\c - c] > I(A)B)Jqq) + (/ 



S^BB 



:/>, 



(21.54) 



where p is some noisy bipartite state that Alice and Bob share at the beginning of the 



,ABR 



protocol, and \ip) is a purification of it. 

Proof. We simply combine the protocol above with teleportation: 



(W 



S^AB . S 



1 



: p b ) + -I(A; R)\q ^ q] + I(A; R)\c -> c] 



l -I{A;R) v [qq] 



> \l{A- B) v [qq] + {I S - BB : p s ) + \l{A- R^q - q] (21.55) 



Canceling terms for both quantum communication and entanglement, we obtain the resource 
inequality in the statement of the corollary. □ 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



538 CHAPTER 21. COHERENT COMMUNICATION WITH NOISY RESOURCES 



The above protocol gives a wonderful operational interpretation to the coherent infor- 
mation (or negative conditional entropy —H(A\B)). When the coherent information is 
positive, Alice and Bob share that rate of entanglement at the end of the protocol (and 
thus the ability to teleport if extra classical communication is available). When the coherent 
information is negative, they need to consume entanglement at a rate of H(A\B) ebits per 
copy in order for the state transfer process to complete. 

Exercise 21.4.1 Suppose that Alice actually possesses the reference R in the above proto- 
cols. Show that Alice and Bob can achieve the following resource inequality: 

(^ ABR ) + \l{A; R\[q -*q\>\ (H(A)^ + H(B)^ + H{R)J [qq], (21.56) 

where \ip) is some pure state. 

21.4.1 The Dual Roles of Quantum Mutual Information 



The resource inequality for entanglement-assisted quantum communication in (21.15) and 



that for quantum-assisted state transfer in (21.51) appear to be strikingly similar. Both 



contain a noisy resource and both consume a noiseless quantum resource in order to generate 
another noiseless quantum resource. We say that these two protocols are related by source- 
channel duality because we obtain one protocol from another by changing channels to states 
and vice versa. 

Also, both protocols require the consumed rate of the noiseless quantum resource to be 
equal to half the quantum mutual information between the system A for which we are trying 
to preserve quantum coherence and the environment to which we do not have access. In both 
cases, our goal is to break the correlations between the system A and the environment, and 
the quantum mutual information is quantifying how much quantum coherence is required 



to break these correlations. Both protocols in (21.15) and (21.51) have their rates for the 
generated noiseless quantum resource equal to half the quantum mutual information between 
the system A and the system B. Thus, the quantum mutual information is also quantifying 
how much quantum correlations we can establish between two systems — it plays the dual 
role of quantifying both the destruction and creation of correlations. 

21.5 Trade-off Coding 

Suppose that you are a communication engineer working at a quantum communication com- 
pany named EA-USA. Suppose further that your company has made quite a profit from 
entanglement-assisted classical communication, beating out the communication rates that 
other companies can achieve simply because your company has been able to generate high- 
quality noiseless entanglement between several nodes in its network, while the competitors 
have not been able to do so. But now suppose that your customer base has become so large 
that there is not enough entanglement to support protocols that achieve the rates given in 

©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



21.5. TRADE-OFF CODING 539 



the entanglement-assisted classical capacity theorem (Theorem 20.3.1 ). Your boss would like 
you to make the best of this situation, by determining the optimal rates of classical com- 
munication for a fixed entanglement budget. He is hoping that you will be able to design a 
protocol such that there will only be a slight decrease in communication rates. You tell him 
that you will do your best. 

What should you do in this situation? Your first thought might be that we have al- 
ready determined unassisted classical codes with a communication rate equal to the channel 
Holevo information x(W) and we have also determined entanglement-assisted codes with a 
communication rate equal to the channel mutual information /(A/ - ). It might seem that a 
reasonable strategy is to mix these two strategies, using some fraction A of the channel uses 
for the unassisted classical code and the other fraction 1 — A of the channel uses for the 
entanglement-assisted code. This strategy achieves a rate of 

A xW + (1 - A)/(A0, (21.57) 

and it has an error no larger than the sum of the errors of the individual codes (thus, 
this error vanishes asymptotically). Meanwhile, it consumes entanglement at a lower rate of 
(1 — \)E ebits per channel use, if E is the amount of entanglement that the original protocol 
for entanglement-assisted classical communication consumes. This simple mixing strategy 
is known as time-sharing. You figure this strategy might perform well, and you suggest it 
to your boss. After your boss reviews your proposal, he sends it back to you, telling you 
that he already thought of this solution and suggests that you are going to have to be a bit 
more clever — otherwise, he suspects that the existing customer base will notice the drop in 
communication rates. 

Another strategy for communication is known as trade-off coding. We explore this strat- 



egy in the forthcoming section and in a broader context in Chapter 24 Trade-off coding 
beats time-sharing for many channels of interest, but for other channels, it just reduces to 
time-sharing. It is not clear a priori how to determine which channels benefit from trade- 
off coding, but it certainly depends on the channel for which Alice and Bob are coding. 



Chapter 24 follows up on the development here by demonstrating that this trade-off coding 
strategy is provably optimal for certain channels, and for general channels, it is optimal in 
the sense of regularized formulas. Trade-off coding is our best known way to deal with the 
above situation with a fixed entanglement budget, and your boss should be pleased with 
these results. Furthermore, we can upgrade the protocol outlined below to one that achieves 
entanglement-assisted communication of both classical and quantum information. 

21.5.1 Trading between Unassisted and Assisted Classical Com- 
munication 

We first show that the resource inequality given in the following theorem is achievable, and we 
follow up with an interpretation of it in the context of trade-off coding. We name the protocol 
CE trade-off coding because it captures the trade-off between classical communication and 
entanglement consumption. 
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Theorem 21.5.1 (CE Trade-off Coding). The following resource inequality corresponds 
to an achievable protocol for entanglement- assisted classical communication over a noisy 
quantum channel 

(AT) + H(A\X) p [qq) > I (AX; B) p [c - c], (21.58) 

where p XAB is a state of the following form: 



„XAB 



Y,Px(x)\x)(x\ x ®N A '^ B {yi A \ 



(21.59) 



an 



d the states ip AA are 



pure. 



Proof. The proof of the above trade-off coding theorem exploits the direct parts of both 



the HSW coding theorem (Theorem 19.3.1) and the entanglement-assisted classical capacity 



theorem (Theorem 20.4.1). In particular, we exploit the fact that the HSW codewords in 



Theorem 19.3.1 arise from strongly typical sequences and that the entanglement-assisted 



quantum codewords from Theorem 20.4.1 are tensor power states after tracing over Bob's 



shares of the entanglement (this is the observation in Remark 20.4.1). Suppose that Alice 
and Bob exploit an HSW code for the channel M A ~^ B . Such a code consists of a codebook 
{Px n {m)} with « 2 nI ( x ' ,B >p quantum codewords. The Holevo information I(X;B) is with 
respect to some classical- quantum state p XB where 



X 

and each codeword p x n( m ) is a tensor-product state of the form 



Px n (m) — Pxi(m) ® Px2(m) 



Px n (m)- 



(21.60) 



(21.61) 



Corresponding to the codebook is some decoding POVM {A^"|, which Bob can employ to 
decode each codeword transmitted through the channel with arbitrarily high probability for 
all e > 0: 



Vm 



K{k%N A ' n - Bn { P S {m) )} 



> 1 



(21.62) 



Recall from the direct part of Theorem 19.3. 1| that we select each codeword from the set of 



strongly typical sequences for the distribution px(x) (see Definition 13.7.2). This implies 



that each classical codeword x n (m) has approximately npxifli) occurrences of the symbol 
a\ G X ', npx{a 2 ) occurrences of the symbol a 2 G X, and so on, for all letters in the alphabet 
X. Without loss of generality and for simplicity, we assume that each codeword x n (m) has 
exactly these numbers of occurrences of the symbols in the alphabet X. Then for any strongly 
typical sequence x n , there exists some permutation -k that arranges it in lexicographical order 
according to the alphabet X. That is, this permutation arranges the sequence x n into \X\ 
blocks, each of length npx{a>i), ■ ■ ■ , npx{a\x\)'- 



7r(a; n ) = a x ■ ■ ■ a x a 2 - ■ ■ a 2 ■■■ a\ X \ • • • a\ X \ . 

npx(ai) np x {a 2 ) 



(21.63) 



np x 



( a \x\) 
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The same holds true for the corresponding permutation operator it applied to a quantum 
state p x n generated from a strongly typical sequence x n : 



n(pz 



Par 
s 



Pa, 



npx(ai) 



] Pa 2 

v 



Pa 



Pa lxl ® 



Pa 



|.v 



(21.64) 



«Px(«2) 



nPx(o|Af|) 



Now, we assume that n is quite large, so large that each of npx(ai), •••, npx{a\x\) are 
large enough for the law of large numbers to come into play for each block in the permuted 
sequence 7r(x n ) and tensor-product state 7r(p x n). Let (p AA ' be a purification of each p A in 
the ensemble {px(x), p A '}, where we assume that Alice has access to system A' and Bob 
has access to A. Then, for every HSW quantum codeword p~ n r m \, there is some purification 



>(m)' 



A' 1 A" 



rx n (m) 



where 



A n A' n 
rx n {m) 



A X A[ 
rxi(m) 



A 2 A' 2 



in AnA ' n 

r x n (m)l 



Alice has access to the systems A'' 



A[ 



r X 2(rn) 

■ ■ ■ A' , and Bob has access to A 7 



Applying the permutation -k to any purified tensor-product state <p x n gives 



n(cp x n] 



¥> ai 

s 



<8 ¥?oi <8> <Pc 



<8 </?a 2 <8> • • • <8 ^a. 



^a 



*l' 



(21.65) 
(21.66) 



"Px(ai) 



«Px(a2) 



np x (a 



\X\ 



where we have assumed that the permutation applies on both the purification systems A n and 
the systems A' n . We can now formulate a strategy for trade-off coding. Alice begins with 
a standard classical sequence x n that is in lexicographical order, having exactly npx{oa) 



occurrences of the symbol Oj G X (of the form in (21.63)). According to this sequence, 
she arranges the states {<p AA } to be in \X\ blocks, each of length npx(cn) — the resulting 
state is of the same form as in (21.66). Since np x ((ii) is large enough for the law of large 
numbers to come into play, for each block, there exists an entanglement-assisted classical code 
with m 2 ' Af ^ a i' 1 entanglement-assisted quantum codewords, where the quantum mutual 
information I(A;B)j^, , is with respect to the state J\f A ^ B ( ( f AA )• Let rii = npx{a>i)- 
Then each of these \X\ entanglement-assisted classical codes consumes riiH(A) A ebits. The 
entanglement-assisted quantum codewords for each block are of the form: 



u(s(h)) Ani (^: iA ' ni )u\s(k)) 



A n i 



(21.67) 



,nI(A;B\ 



^A n iA'' 



where /j is a message in the message set of size ~ 2 ' Va i , the state (p^. '"■ ' = (p. 

■ ■ ■ <8 <Pa.i • • 



A-,A\ 



and t he unitaries U(s(li)) A ' are of the form in (20.46). Observe that the 
codewords in (21.67) are all equal to p A ' H after tracing over Bob's systems A n \ regardless of 



the particular unitary that Alice applies (this is the content of Remark 20.4.1). Alice then 
determines the permutation -K m needed to permute the standard sequence x n to a codeword 
sequence x n (m), and she applies the permutation operator 7r m to her systems A' n so that 
her channel input density operator is the HSW quantum codeword p A Z m \ (we are tracing 



over Bob's systems A n and applying Remark 20.4.1 to obtain this result). She transmits 
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her systems A' n over the channel to Bob. If Bob ignores his share of the entanglement in 
A n , the state that he receives from the channel is N An ^ Bn (Px"( m ))- He then applies his 
HSW measurement {A^™} to the systems B n received from the channel, and he determines 
the sequence x n (m), and hence the message m, with nearly unit probability. Also, this 
measurement has negligible disturbance on the state, so that the post-measurement state is 
2-^/e-close in trace distance to the state that Alice transmitted through the channel (in what 
follows, we assume that the measurement does not change the state, and we collect error 
terms at the end of the proof). Now that he knows m, he applies the inverse permutation 
operator 7T" 1 to his systems B n , and we are assuming that he already has his share A n of 
the entanglement arranged in lexicographical order according to the standard sequence x n . 
His state is then as follows: 

1*1 

\ A ' H ( ,„A n iA' n i \ttU „n \\ A " 



U{s{k)) A ' H U^^)u\s{k)) A - (21-68) 



(=i 



At this point, he can decode the message k in the i th block by performing a collective mea- 
surement on the systems A ni A' n \ He does this for each of the \X\ entanglement-assisted 
classical codes, and this completes the protocol for trade-off coding. The total error accu- 
mulated in this protocol is no larger than the sum of e for the first measurement, 2-^/e for 
the disturbance of the state, and \X\e for the error from the final measurement of the \X\ 
blocks. The proof here assumes that every classical codeword x n (m) has exactly npx{o>i) 
occurrences of symbol Oj 6 X, but it is straightforward to modify the above protocol to 



allow for imprecision, i.e., if the codewords are 5-strongly typical. Figure [21.4| depicts this 
protocol for an example. We now show how the total rate of classical communication adds 
up to I(AX; B) where p XAB is a state of the form in (21.59). First, we can apply the chain 
rule for quantum mutual information to observe that the total rate I(AX; B) is the sum of 
a Holevo information I(X;B) and a classically conditioned quantum mutual information 
I(A-B\X) p . 

I(AX; B) p = I(X; B) p + I(A;B\X) p . (21.69) 

They achieve the rate I(X; B) because Bob first reliably decodes the HSW quantum code- 
word, of which there can be ~ 2 nI<yX ' ,B \ His next step is to permute and decode the \X\ 
blocks, each consisting of an entanglement-assisted classical code on « npx(x) channel uses. 
Each entanglement-assisted classical code can communicate npx(x)I(A; B) bits while con- 
suming npx{x)H(A) ebits. Thus, the total rate of classical communication for this last part 
is 

# of bits generated Y, x n Px{x)I{A; B) 



^ (21.70) 

# of channel uses J2 x n Px(x) 

= J2Px(x)I(A;B) px (21.71) 

X 

= I(A;B\X). (21.72) 
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Figure 21.4: A simple protocol for trade-off coding between assisted and unassisted classical communica- 
tion. Blue systems belong to Bob, and red systems belong to Alice. Alice wishes to send the classical message 
m while also sending the messages 1%, 1%, and I3. Her HSW codebook has the message m map to the sequence 
1231213, which in turn gives the HSW quantum codeword pi ®> P2 ®> P3 ®> Pi ®> P2 ®> Pi ®> P3- A purification 
of these states is the following tensor product of pure states: ipi <S> tf2 <8> f3 <8> fi ® ^2 ® </?i ® ¥>3, where 
Bob possesses the purification of each state in the tensor product. She begins with these states arranged in 
lexicographic order in three blocks (there are three letters in this alphabet). For each block i, she encodes 
the message li with the local unitaries for an entanglement-assisted classical code. She then permutes her 
shares of the entangled states according to the permutation associated with the message m. She inputs 
her systems to many uses of the channel, and Bob receives the outputs. His first action is to ignore his 
shares of the entanglement and perform a collective HSW measurement on all of the channel outputs. With 
high probability, he can determine the message m while causing a negligible disturbance to the state of the 
channel outputs. Based on the message m, he performs the inverse of the permutation that Alice used at 
the encoder. He combines his shares of the entanglement with the permuted channel outputs. His final three 
measurements are those given by the three entanglement-assisted codes Alice used at the encoder, and they 
detect the messages l\, I2, and I3 with high probability. 
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and similarly, the total rate of entanglement consumption is 

# of ebit consumed Y2 X n Px(x)H(A) 
# of channel uses J2 X n Px{%) 

= J2px(x)H(A) Px 

X 

= H(A\X) p . 
This gives the resource inequality in the statement of the theorem. 



(21.73) 
(21.74) 

(21.75) 

□ 



21.5.2 Trade-off Coding Subsumes Time-Sharing 

Before proceeding to other trade-off coding settings, we show how time-sharing emerges as a 



special case of a trade-off coding strategy. Recall from (21.57) that time-sharing can achieve 
the rate A x(-A/*) + (1 — A) J (TV) for any A such that < A < 1. Suppose that (j) AA is the pure 
state that maximizes the channel mutual information I(J\f), and suppose that {px(x),ip A } 
is an ensemble of pure states that maximizes the channel Holevo information x(W') (recall 



from Theorem 12.3.2 that it is sufficient to consider pure states for maximizing the Holevo 
information of a channel). Time-sharing simply mixes between these two strategies, and we 



can construct a classical-quantum state of the form in (21.59), for which time-sharing turns 



out to be the strategy executed by the constructed trade-off code: 



o 



UXAB 



(i-A)|o)(or ®|o)(or®AT A ' 



rA'->B/ iAA' 



+ A|l)(l| f/ ® Y,Px{x)\x)(x\ x <g> |0)(0| A ® M A '^ B {^'). (21.76) 

X 

In the above, the register U is acting as a classical binary flag to indicate whether the code 
should be an entanglement-assisted classical capacity achieving code or a code that achieves 
the channel's Holevo information. The amount of classical bits that Alice can communicate 
to Bob with a trade-off code is I(AUX; B) a , where we have assumed that U and X together 
form the classical register. We can then evaluate this mutual information by applying the 
chain rule: 



I(AUX; B) a = I (A; B\XU) a + I(X; B\U) a + I(U; B)„ 



(1-X)I(A;B), 



A 



(1 - X)I(X; -B)|o}(o|®at(0) + 
>(l-A)/(A0 + Ax(A0. 



2 J Px(x)I(A;B) l0)(0 ^ N - ( ^ ) 

. x 

XI(X-B) MxUx} + I(U;B) a 



(21.77) 



(21.78) 
(21.79) 



The second equality follows by evaluating the first two conditional mutual informations. The 
inequality follows from the assumptions that I (AT) = I (A; B)^,,-, and x(AT) = I(X; B) , , > . -, , 
the fact that quantum mutual information vanishes on product states, and I(U; B) a > 0. 
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Thus, in certain cases, this strategy might do slightly better than time-sharing, but for 
channels for which (j) A = ^^(a;)^ , this strategy is equivalent to time-sharing because 
I(U; B) a = in this latter case. 

Thus, time-sharing emerges as a special case of trade-off coding. In general, we can try 
to see if trade-off coding beats time-sharing for certain channels by optimizing the rates in 



Theorem 21.5.1 over all possible choices of states of the form in (21.59). 



21.5.3 Trading between Coherent and Classical Communication 



We obtain the following corollary of Theorem 21.5.1 simply by upgrading the \X\ entanglement 



assisted classical codes to entanglement-assisted coherent codes. The upgrading is along the 



same lines as that in the proof of Theorem 21.1.1. and for this reason, we omit the proof. 



Corollary 21.5.1. The following resource inequality corresponds to an achievable protocol 
for entanglement-assisted coherent communication over a noisy quantum channel J\f: 



(A/") + H(A\X) p [qq] > I(A; B\X) p [q - qq] + I(X; B) p [c - c], (21.80) 

where p XAB is a state of the following form: 

p xab _ J2px(x)\x){x\ x ®Af A '^ B (^ A 'l (21.81) 

x 

and the states ip AA are pure. 

21.5.4 Trading between Classical Communication and Entanglement- 
Assisted Quantum Communication 

We end this section with a protocol that achieves entanglement-assisted communication of 
both classical and quantum information. It is essential to the trade-off between a noisy 
quantum channel and the three resources of noiseless classical communication, noiseless 
quantum communication, and noiseless entanglement. We study this trade-off in full detail 



in Chapter 24, where we show that combining this protocol with teleportation, super-dense 
coding, and entanglement distribution is sufficient to achieve any task in dynamic quantum 
Shannon theory involving the three unit resources. 

Corollary 21.5.2 (CQE Trade-off Coding). The following resource inequality corresponds 
to an achievable protocol for entanglement- assisted communication of classical and quantum 
information over a noisy quantum channel 

(A/"} + h(A; E\X) p [qq] > h(A;B\X) p [q - q) + I(X; B) p [c -, c], (21.82) 

where p XAB is a state of the following form: 

p xabe s J2px(x)\x)(x\ X ® I#-^(0, (21-83) 

X 

the states <p AA are pure, and Ujj-^ BE is an isometric extension of the channel M A ^ B . 
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Proof. Consider the following chain of resource inequalities: 

{N)+H{A\X) p [qq] 

> I(A;B\X) p [q -► qq] + I(X; B) p [c - c] (21.84) 

> h{A- B\X) p [qq] + l -I{A-B\X) p [q ^ q] + I(X; B) p [c - c]. (21.85) 



The first inequality is the statement in Corollary 21.5.1 , and the second inequality fol- 
lows from the coherent communication identity. After resource cancelation and noting that 
H(A\X) — \I{A) B\X) = \I{A] E\X) , the resulting resource inequality is equivalent to 



the one in (21.82). D 



21.5.5 Trading between Classical and Quantum Communication 

Our final trade-off coding protocol that we consider is that between classical and quantum 
communication. The proof of the below resource inequality follows by combining the protocol 



in Corollary 21.5.2 with entanglement distribution, in much the same way as we did in 



Corollary 21.2.1, Thus, we omit the proof. 



Corollary 21.5.3 (CQ Trade-off Coding). The following resource inequality corresponds to 
an achievable protocol for simultaneous classical and quantum communication over a noisy 
quantum channel 

(AT) > I(A)BX) p [q ^q) + I(X; B) p [c - c], (21.86) 

where p XAB is a state of the following form: 

p xab s J2px(x)\x)(x\ x ®Af A ^ B (^ A '), (21.87) 

x 

and the states ip AA are pure. 

21.6 Concluding Remarks 

The maintainence of quantum coherence is the theme of this chapter. Alice and Bob can 
execute powerful protocols if they perform encoding and decoding in superposition. In both 
entanglement-assisted coherent communication and coherent state transfer, Alice performs 
controlled gates instead of conditional gates and Bob performs coherent measurements that 
place measurement outcomes in an ancilla register without destroying superpositions. Also, 
Bob's final action in both of these protocols is to perform a controlled decoupling unitary, 
ensuring that the state of the environment is independent of Alice and Bob's final state. 
Thus, the same protocol accomplishes the different tasks of entanglement-assisted coher- 
ent communication and coherent state transfer, and these in turn can generate a whole 
host of other protocols by combining them with entanglement distribution and the coher- 
ent and incoherent versions of teleportation and super-dense coding. Among these other 
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generated protocols are entanglement-assisted quantum communication, quantum commu- 
nication, quantum- assisted state transfer, and classical-assisted state transfer. The exercises 
in this chapter explore further possibilities if Alice has access to the environments of the 
different protocols — the most general version of coherent teleportation arises in such a case. 
Trade-off coding is the theme of the last part of this chapter. Here, we are addressing 
the question: Given a fixed amount of a certain resource, how much of another resource can 
Alice and Bob generate? Noisy quantum channels are the most fundamental description of 
a medium over which information can propagate, and it is thus important to understand the 
best ways to make effective use of such a resource for a variety of purposes. We determined 
a protocol that achieves the task of entanglement-assisted communication of classical and 
quantum information, simply by combining the protocols we have already found for classical 



communication and entanglement-assisted coherent communication. Chapter 24 continues 
this theme of trade-off coding in a much broader context and demonstrates that the pro- 
tocol given here, when combined with teleportation, super-dense coding, and entanglement 
distribution, is optimal for some channels of interest and essentially optimal in the general 
case. 



21.7 History and Further Reading 

Devetak et al. showed that it was possible to make the protocols for entanglement-assisted 
classical communication and noisy super-dense coding coherent [71] [70], leading to Theo- 
rems 



21.1.1| and |21.4.1] They called these protocols the "father" and "mother," respectively, 
because they generated many other protocols in quantum Shannon theory by combining 
them with entanglement distribution, teleportation, and super-dense coding. Horodecki et 
al. |152j formulated a protocol for noisy super-dense coding, but our protocol here makes use 
of the coding technique in Ref . |156| . Shor first proved a coding theorem for trading between 
assisted and unassisted classical communication [229], and Devetak and Shor followed up on 
this result by finding a scheme for trade-off coding between classical and quantum commu- 
nication. Some time later, Ref. |159j generalized these two coding schemes to produce the 
result of Theorem 121.5.21 
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Private Classical Communication 



We have now seen in Chapters [19121] how Alice can communicate classical or quantum infor- 
mation to Bob, perhaps even with the help of shared entanglement. One might argue that 
these communication tasks are the most fundamental tasks in quantum Shannon theory, 
given that they have furthered our understanding of the nature of information transmission 
over quantum channels. Though, when discussing the communication of classical informa- 
tion, we made no stipulation as to whether this classical information should be public, so 
that any third party might have access to it, or private, so that any third party does not 
have access. 

This chapter introduces the private classical capacity theorem, which gives the maximum 
rate at which Alice can communicate classical information privately to Bob without anyone 
else in the universe knowing what she sent to him. The information processing task cor- 
responding to this theorem was one of the earliest studied in quantum information theory, 
with the Bennett-Brassard-84 quantum key distribution protocol being the first proposed 
protocol for exploiting quantum mechanics to establish a shared secret key between two 
parties. The private classical capacity theorem is important for quantum key distribution 
because it establishes the maximum rate at which two parties can establish a shared secret 
key. 

Another equally important, but less obvious utility of private classical communication 
is in establishing a protocol for quantum communication at the coherent information rate. 



Section |21.2| demonstrated a somewhat roundabout way of arriving at the conclusion that it 
is possible to communicate quantum information reliably at the coherent information rate — 
recall that we "coherified" the entanglement-assisted classical capacity theorem and then 
exploited the coherent communication identity and catalytic use of entanglement. Estab- 
lishing achievability of the coherent information rate via private classical coding is another 
way of arriving at the same result, with the added benefit that the resulting protocol does 
not require the catalytic use of entanglement. 

The intuition for quantum communication via privacy arises from the no-cloning theorem. 
Suppose that Alice is able to communicate private classical messages to Bob, so that the 
channel's environment (Eve) is not able to distinguish which message Alice is transmitting 
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to Bob. That is, Eve's state is completely independent of Alice's message if the transmitted 
message is private. Then we might expect it to be possible to make a coherent version of 
this private classical code by exploiting superpositions of the private classical codewords. 
Since Eve's states are independent of the quantum message that Alice is sending through 
the channel, she is not able to "steal" any of the coherence in Alice's superposed states. 
Given that the overall evolution of the channel to Bob and Eve is unitary and the fact that 
Eve does not receive any quantum information with this scheme, we should expect that the 
quantum information appears at the receiving end of the channel so that Bob can decode 
it. Were Eve able to obtain any information about the private classical messages, then Bob 
would not be able to decode all of the quantum information when they construct a coherent 
version of this private classical code. Otherwise, they would violate the no-cloning theorem. 
We discuss this important application of private classical communication in the next chapter. 
This chapter follows a similar structure as previous chapters. We first detail the informa- 



tion processing task for private classical communication. Section [22.2| then states the private 
classical capacity theorem, with the following two sections proving the achievability part and 
the converse part. We end with a general discussion of the private classical capacity and a 
brief overview of the secret-key-assisted private classical capacity. 

22.1 The Information Processing Task 

We begin by describing the information processing task for private classical communication 
(we define an (n,P — 5,e) private classical code). Alice selects a message m uniformly at 
random from a set A4 of messages, and she also selects an index k uniformly at random from 
a set K, to assist in randomizing Eve's information (thus, the set /C is a privacy amplification 
set)Y\ Let M and K denote the random variables corresponding to Alice's choice of m 
and k, respectively. Alice prepares some state p^ k as input to many uses of the quantum 
channel J\f ~^ B . Randomizing over the K variable leads to the following state at the channel 
input: 

A ,n _ 1 ^ A ,n , . 

Pm =Jfc\2^ P rn,k- (22.1) 

Alice transmits p^ k over many independent uses of the channel, producing the following 
state at Bob's receiving end: 

where Ai A ' n - Bn = (Ai A '- B )® n . 

Bob employs a decoding POVM {A m ^} in order to detect Alice's transmitted message 
m and the randomization variable k. The probability of error for a particular pair (m, k) is 



1 By convention, we usually assume that Alice picks her message M uniformly at random, though this is 
not strictly necessary. In the case of the randomizing variable if, it is required for Alice to select it uniformly 
at random in order to randomize Eve's knowledge of the Alice's message M . Otherwise, Alice's message M 
will not be indistinguishable to Eve. 

©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



22. 1 . THE INFORMATION PROCESSING TASK 



551 



as follows: 

p e (m,fc) = Tr{(/-A mifc )AT A ' 

so that the maximal probability of error is 

P\ 
The rate P of this code is 



\Pm,k)ji 



max p e (m,k). 

m£M,k£>C 



P=-log 2 \M\ + 5, 

n 



(22.3) 
(22.4) 
(22.5) 



where 5 is some arbitrarily small positive number. 

So far, the above specification of a private classical code is nearly identical to that for the 



transmission of classical information outlined in Section 19.2 What distinguishes a private 



classical code from a public one is the following extra condition for privacy. Let Ufa ~^ BE be 



rA'^E 



to 



an isometric extension of the channel M A ~^ B , so that the complementary channel A/* 
the environment Eve is as follows: 

M A '^ E {a) = Tr B {U x aU]^}. (22.6) 

If Alice transmits a message m, while selecting the variable k uniformly at random in order 
to randomize Eve's knowledge of the message m, then the expected state for Eve is as follows: 



OJ. 



E n 



-Ls2ti A ' n - En (p£Z). 



(22.7) 



k&K, 



Our condition for e-privacy is that Eve's state is always close to a constant state, regardless 
of which message m Alice transmits through the channel: 



Vm E M : \\uj 



E" 



U) 



< e. 



(22.8) 

This definition is the strongest definition of privacy because it implies that Eve cannot learn 
anything about the message m that Alice transmits through the channel. Let a En denote 
Eve's state averaged over all possible messages: 



a 



E n 



— y 

\M\ ^ 



CO 



E" 



m£M 



It follows that 



<7 



E n 



U 



,E n \ 



<e, 



(22.9) 



(22.10) 



by applying convexity of the trace distance to (22.8). The criterion in (22.8) implies that 



Eve's Holevo information with M is arbitrarily small: 
I{M; E n ) = H{E n ) - H{E n \M) 



< H(w E ') - -J- Y, H(u E ') + 4m log d E + 4H 2 (e) 

= Aen\ogd E + ^H 2 {t). 



(22.11) 
(22.12) 

(22.13) 
(22.14) 
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Figure 22.1: The information processing task for private classical communication. Alice encodes some 
private message m into a quantum codeword p^ and transmits it over many uses of a quantum channel. 
The goal of such a protocol is for Bob to be able to reliably distinguish the message, while the channel's 
environment Eve should not be able to learn anything about it. 



The inequality follows from applying Fannes' inequality (Theorem 11.9.5) to both entropies. 
Thus, if e is exponentially small in n (which will be the case for our codes), then it is possible 
to make Eve's information about the message become arbitrarily small in the asymptotic 
limit. Figure 22.1 depicts the information processing task for private classical communica- 
tion. 

In summary, we say that a rate P of private classical communication is achievable if there 
exists an (n, P — S, e) private classical code for all e, 5 > and sufficiently large n, where e 
characterizes both the reliability and the privacy of the code. 



22.2 The Private Classical Capacity Theorem 

We now state the main theorem of this chapter, the private classical capacity theorem. 

Theorem 22.2.1 (Devetak-Cai- Winter- Yeung). The private classical capacity of a quantum 
channel M A ~^ B is the supremum over all achievable rates for private classical communica- 
tion, and one characterization of it is the regularization of the private information of the 
channel: 

sup{P | P is achievable} = P reg (J\f), (22.15) 

where 

(22.16) 



W^^)- 



The private information P(J\f) is defined as 



P(Af) = m^[I(X;B) a -I(X;E) a ] 
p 



(22.17) 
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where p is a classical-quantum state of the following form: 



JCA! 



^2p x (x)\x)( 



x\ X ® P i\ 



(22.18) 



and 



a 



XBE 



UJ^ (p ), with U^f^ BE an isometric extension of the channel M 



A'^B 



We first prove the achievability part of the coding theorem and follow with the converse 
proof. Recall that the private information is additive whenever the channel is degradable 



(Theorem 12.6.3). Thus, for this class of channels, the regularization in (22.16) is not 



necessary and the private information of the channel is equal to the private classical capacity 



in fact, the results from Theorem 12.6.2 and the next chapter demonstrate that the private 



information of a degradable channel is also equal to its quantum capacity). Unfortunately, 
it is known in the general case that the regularization of the private information is necessary 
in order to characterize the private capacity because there is an example of a channel for 
which the private information is superadditive. 



22.3 The Direct Coding Theorem 



This section gives the proof that the private information in (22.17) is an achievable rate for 
private classical communication over a quantum channel J\f A '^ B . We first give the intuition 
behind the protocol. Alice's goal is to build a doubly- indexed codebook {x n (m, k)} meM k&lc 
that satisfies two properties: 



1. Bob should be able to detect the message m and the "junk" variable k with high 



probability. From the classical coding theorem of Chapter [19| our intuition is that he 
should be able to do so as long as |A4||/C| « 2 n/ ( X;B ). 

Randomizing over the "junk" variable k should approximately cover the typical sub- 
space of Eve's system, so that every state of Eve depending on the message m looks 
like a constant, independent of the message m Alice sends (we would like the code to 



satisfy (22.8)). Our intuition from the Covering Lemma (Chapter 16) is that the size 
of the "junk" variable set /C needs to be at least |/C| ~ 2 nI ^ x ' E " 1 in order for Alice to 
approximately cover Eve's typical subspace. 



Our method for generating a code is again of course random because we can invoke 
the typicality properties that hold in the asymptotic limit of many channel uses. Thus, if 
Alice chooses a code that satisfies the above criteria, she can send approximately |A4| ~ 
2n[i{X]B)-i(X;E)} distinguishable signals to Bob such that they are indistinguishable to Eve. 
We devote the remainder of this section to proving that the above intuition is correct. 



Figure 22.2 displays the anatomy of a private classical code. 
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Bob's Typical Subspace 



Set of Classical Codewords 




Eve's Typical Subspace 



Eve's Typical Subspace 



Figure 22.2: The anatomy of a code for private classical communication. In this illustrative example, 
Alice has eight codewords, with each depicted as a • and indexed by m € {1,2} and k 6 {1,2,3,4}. Thus, 
she is interested in sending one of two messages and has the "junk" variable k available for randomizing 
Eve's state. Each classical codeword x n (m, k) maps to a distinguishable subspace on Bob's typical subspace 
(we show two of the mappings in the figure, while displaying eight distinguishable subspaces). From the 
Packing Lemma, our intuition is that Alice can reliably send about 2 nI< - X]B > distinguishable signals. The 
codewords {x n (l, k)} ke , 1 2 3 4 i and {x n (2, k)} kG , x 2 3 4 i are each grouped in a box to indicate that they 
form a privacy amplification set. When randomizing k, the codewords {x n (l, k)} k ^< 1 2 3 4 i uniformly cover 
Eve's typical subspace (and so does the set {cc"(2, fc)} /cG r 1 2 3 4 i ), so that it becomes nearly impossible in 
the asymptotic limit for Eve to distinguish whether Alice is sending a codeword in {x n (l,k)} 



fce{l,2,3,4} 



or 



{x n (2 1 fc)} fcG M 2 3 4>- I n thi s w a v i Eve cannot determine which message Alice is transmitting. The minimum 
size for each privacy amplification set in the asymptotic limit is « 2 nI ( X]E '. 
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Party 


Quantity 


Typical 


Set/Subspace 


Projector 


Alice 
Bob 

Bob conditioned on x n 

Eve 

Eve conditioned on x n 


X 

p Rn 

on 

Px n 

p En 

P x n 


rpB n 

1 S 

rpB n \x n 

1 S 

rpE n 
rpE n \x n 

1 S 


N/A 

nf 

u E"\x" 



Table 22.1: The above table lists several mathematical quantities involved in the construction of a random 
private code. The first column lists the party to whom the quantities belong. The second column lists the 
random classical or quantum states. The third column gives the appropriate typical set or subspace. The 
final column lists the apprpriate projector onto the typical subspace for the quantum states. 



22.3.1 Dimensionality Arguments 

Before giving the proof of achievability, we confirm the above intuition with some dimension- 
ality arguments and show how to satisfy the conditions of both the Packing and Covering 
Lemmas. Suppose that Alice has some ensemble {px(x),p x } from which she can generate 
random codes. Let Uj§-^ BE denote the isometric extension of the channel J\f A ^ B , and let 
p BE denote the joint state of Bob and Eve after Alice inputs p x : 

Px E = U„pi'U},. (22.19) 

The local respective density operators for Bob and Eve given a message x are as follows: 

p° = Tr E {p° E }, pB = Tr B {p° E }. (22.20) 



The expected respective density operators for Bob and Eve are as follows: 



^2'Px{x)p 



B 

x ) 



J2px(x)p e . 



(22.21) 



Given a particular input sequence x n , we define the n th extensions of the above states as 
follows: 



P x n = LV E "\P x n /, 

n E n _ rp f B n E n \ 

P x n — LT B n\P x n /■ 

B" 



P 



2^ PX"(X ) Px n, 

n ex n 

Y^ pxn{x n ) P E :. 



(22.22) 
(22.23) 
(22.24) 

(22.25) 



x n &X n 



The following four conditions corresponding to the Packing Lemma hold for Bob's states 
{p x "}, Bob's average density operator p B ", Bob's typical subspace Tf n , and Bob's condi- 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



556 



CHAPTER 22. PRIVATE CLASSICAL COMMUNICATION 



tionally typical subspace T, 



B n \x n . 



Tr{pf:nf}>l-e, 

{pSrnf'-}>i-e, 

Tr(n B " |a:n ) < 2 n{H{B \ x)+c5) 

nf n P Bn nf n < 2 -»("( B )-<*)nr 



S ) 



(22.26) 

(22.27) 

(22.28) 
(22.29) 



where c is some positive constant (see Properties 14.2.7, 14.1.3, 14.1.2, and 14.1.1). 

The following four conditions corresponding to the Covering Lemma hold for Eve's states 
\p%n }, Eve's typical subspace T 5 , and Eve's conditionally typical subspace T s : 



Tr{pf:nf }>l-e, 
Tr{pf:nf'-}>l-e, 

Tr{nf )< 2 < H ^ +c5 \ 

E n \x n E n T1 E n \x n . -n(H(E\X)-cS) T[ E n \x" 
il 8 Px n Li S — z Li 5 



(22.30) 
(22.31) 
(22.32) 
(22.33) 



The above properties suggest that we can use the methods of both the Packing Lemma 
and the Covering Lemma for constructing a private code. Consider two sets M. and K with 
the following respective sizes: 



|^| _ 2 n[I(X;B)-I(X;E)-6cd]^ 

so that the product set M. x /C indexed by the ordered pairs (m, k) is of size 

\MxK\ = \M\\JC\ = 2 n ^^ B ^ c5 \ 



(22.34) 
(22.35) 



(22.36) 



The sizes of these sets suggest that we can use the product set M. x K for sending classical 
information, but we can use \M.\ "privacy amplification" sets each of size |/C| for reducing 



Eve's knowledge of the message m (see Figure 22.2). 



22.3.2 Random Code Construction 

We now argue for the existence of a good private classical code with rate P 
I(X; E) if Alice selects it randomly according to the ensemble 

\Px' n \ x )iPx n fy 



I(X-B) 



(22.37) 



where p'x>n(x n ) is the pruned distribution (see Section 19.3.1 — recall that this distribution is 
close to the IID distribution). Let us choose |.M||/C| random variables X n (m,k) according 
to the distribution p X 'n(x n ) where the realizations of the random variables X n (m,k) take 
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m£M,k£K, 



IS 



values in X n . After selecting these codewords randomly, the code C = {x n (m, k)} 

then a fixed set of codewords x n (m, k) depending on the message m and the randomization 

variable k. 

We first consider how well Bob can distinguish the pair (m, k) and argue that the random 
code is a good code in the sense that the expectation of the average error probability over 
all codes is low. The Packing Lemma is the basis of our argument. By applying the Packing 
Lemma (Lemma 15.3.1) to (22.26 22.29), there exists a POVM (A m) j.), k)€Mxic correspond- 



ing to the random choice of code that reliably distinguishes the states {pxiim k)}m£M,keic i n 
the following sense: 



E c { Pe (c)} = i - eJ — 1— ]T E Tr KW) A »^} I 



<2 e + V8e +4 



2 e + V8e +4 



on(H(B)-c5) 



2 n{H{B\X)+c8)\ M X £| 



2n{H(B)-c5) 



2n(H(B\X)+cS)2n[I(X;B)-3c6] 



2(e + V8e) + 4 • 2~ ncS = e'. 



(22.38) 

(22.39) 

(22.40) 
(22.41) 



where the first equality follows by definition, the first inequality follows by application of the 



Packing Lemma to the conditions in (22.26 22.29 ), the second equality follows by substitution 



of (22.36), and the last equality follows by a straightforward calculation. We can make e' 



arbitrarily small by choosing n large enough. 

Let us now consider the corresponding density operators Px™(mfe) ^ or -^ ve - Consider 
dividing the random code C into \A4\ privacy amplification sets each of size |/C|. Each 
privacy amplification set C m = {Px^imk)}^^ °^ density operators forms a good covering 



code according to the Covering Lemma (Lemma 16.2.1). The fake density operator of each 
privacy amplification set C m is as follows: 



1 v^ E n 

m^ pxn 



(m,fc)' 



(22.42) 



fce/c 



because Alice chooses the randomizing variable k uniformly at random. The obfuscation 
error o e (C m ) of each privacy amplification set C m is as follows: 



o e (C r , 



\r m 



All, 



(22.43) 



where p E ' 1 is defined in (22.25). The Covering Lemma (Lemma 16.2.1) states that the 



obfuscation error for each random privacy amplification set C m has a high probability of 
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being small if n is sufficiently large and \fC\ is chosen as in (22.35): 



Pr{o e (C m )<e + 4^ + 24^} 

-e 3 \fC\2< H ( E \ x )- cS ) 
>l-2d E exp ' 



1 - 2d™ exp 



2d E exp 



4 In 2 2< H ^ +c6 "> 

_ £ 3 2 n i I ( x ^ E )+3cS}2n(H{E\X)-cS) 



4 In 2 



4 In 2 



2n(H(E)+cS) 



>nc<5 



In particular, let us choose n large enough so that the following bound holds: 



Pr{o e (C m ) < e + 4y/~e + 24^e"} > 1 



\M\- 



(22.44) 
(22.45) 
(22.46) 

(22.47) 



That we can do so follows from the important fact that expj— e 3 2 nc5 /(41n2)} is doubly 
exponentially decreasing in n. (We also see here why it is absolutely necessary to have the 
"wiggle room" given by an arbitrarily small, yet strictly positive 5.) 

This random construction already has some of the desirable features that we are looking 
for in a private code just by choosing n to be sufficiently large. The expectation of Bob's 
average error probability for detecting the pair m, k is small, and the obfuscation error of 
each privacy amplification set has a high probability of being small. Our hope is that there 
exists some code for which Bob can retrieve the message m with the guarantee that Eve's 
knowledge is independent of this message m. We argue in the next two sections that such a 
good private code exists. 



22.3.3 Derandomization 

We now apply a derandomization argument similar to the one that is needed in the proof 
of the HSW coding theorem. The argument in this case is more subtle because we would 
like to find a code that has good classical communication with the guarantee that it also has 
good privacy. We need to determine the probability over all codes that there exists a good 
private code. If this probability is non-zero, then we are sure that a good private code exists. 
As we have said at the beginning of this section, a good private code has two qualities: 
the code is e-good for classical communication and it is e-private as well. Let Eq denote the 
event that the random code C is e-good for classical communication: 



E = {p e (C) < e}, 



(22.48) 



where we restrict the performance criterion to the average probability of error for now. Let 
E m denote the event that the m th message in the random code is e-private: 



E m = {o e (C m ) < e}. 



(22.49) 
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We would like all of the above events to be true, or, equivalently, we would like the intersection 
of the above events to occur: 

E pTiv = E n p| E m . (22.50) 

If there is a positive probability over all codes that the above event is true, then there exists 
a particular code that satisfies the above conditions. Let us instead consider the complement 
of the above event (the event that a good private code does not exist): 



E; m = E<U |J E c m . (22.51) 



rriGM 



We can then exploit the union bound from probability theory to bound the probability of 
the complementary event £p riv as follows: 

Prj E< U |J E c m 1 < Pr{£ c } + £ Pr{*£}. (22.52) 

So if we can make the probability of the event E^ r - lY small, then the probability of the event 
.Epriv that there exists a good private code is high. 

Let us first bound the probability of the event Eq. Markov's inequality states that the 
following holds for a non- negative random variable Y: 

Pr{F >a}< -^. (22.53) 

a 

We can apply Markov's inequality because the random average error probability p e (C) is 
always non-negative: 

Pr TO = Prk(C) > (e') 3/4 } < ^%M < t4i = ^ ( 22 - 54 ) 

So we now have a good bound on the probability of the complementary event Eq. 

Let us now bound the probability of the events E^. The bounds in the previous section 
already give us what we need: 

Pr W„} = Pr{o,(C m ) >< + 4\/i + 24^} (22.55) 

K w\- (22 - 56) 

implying that 

J2 MK} < \M\j^- = e. (22.57) 

So it now follows that the probability of the complementary event is small: 

Pr{^ riv }<^ + e, (22.58) 
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and there is a high probability that there is a good code: 

Pr{£ P riv} > 1 - (v^ + e) . (22.59) 

Thus, there exists a particular code C such that its average probability of error is small 
for decoding the classical information: 

Pe(C) < (ef /4 , (22.60) 

and the obfuscation error of each privacy amplification set is small: 

Vm :o e (C m ) < e + 4^e + 24^. (22.61) 

The derandomized code C is as follows: 

C = {x n (m,k)} meM>keK , (22.62) 

so that each codeword x n (m,k) is a deterministic variable. Each privacy amplification set 
for the derandomized code is as follows: 

C m = {x n (m,k)} keK . (22.63) 



The result in (22.59) is perhaps astonishing in hindsight. By choosing a private code in 
a random way and choosing the block length n of the private code to be sufficiently large, 
the overwhelming majority of codes constructed in this fashion are good private codes! 

22.3.4 Expurgation 

We would like to strengthen the above result even more, so that the code has a low maximal 
probability of error, not just a low average error probability. We expurgate codewords from 
the code as before, but we have to be careful with the expurgation argument because we 
need to make sure that the code still has good privacy after expurgation. 



We can apply Markov's inequality for the expurgation in a way similar as in Exercise 2.2.1 



It is po ssible to apply Markov's inequality to the bound on the average error probability in 



( |22.54 ) to show that at most a fraction vV of the codewords have error probability greater 
than v?. We could merely expurgate the worst \fe' codewords from the private code. But 
expurgating in this fashion does not guarantee that each privacy amplification set has the 
same number of codewords. Therefore, we expurgate the worst fraction \fd of the codewords 
in each privacy amplification set. We then expurgate the worst fraction Vt of the privacy 
amplification sets. The expurgated sets Ai' and /C' both become a fraction 1 — Vt of their 
original size. We denote the expurgated code as follows: 

C' = {x n (m,k)} meM , MICI , (22.64) 

and the expurgated code has the following privacy amplification sets: 

C' m = {x n (m,k)} kelc ,. (22.65) 
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The expurgation has a negligible impact on the rate of the private code when n is large. 

Does each privacy amplification set still have good privacy properties after performing 
the above expurgation? The fake density operator for each expurgated privacy amplification 
set is as follows: 

^'-wtE^w)- ( 22 - 66 ) 



rm 



keK' 

It is possible to show that the trace distance between the fake density operators in the 
derandomized code are 2v / e'-close in trace distance to the fake density operators in the 
expurgated code: 

\fmeM' \\p E m n '-p E m n \\ l <2^\ (22.67) 

because these operators only lose a small fraction of their mass after expurgation. 

We now drop the primed notation for the expurgated code. It follows that the expurgated 
code C has good privacy: 



VmeM \\p% - p E \\ l <e + 4^ + 24^ + 2Ve', (22.68) 

and reliable communication: 



\/meM,keK p e (C,m,k) <\U'. (22.69) 



The first expression follows by application of the triangle inequality to (22.61) and (22.67). 
We end the proof by summarizing the operation of the private code. Alice chooses the 
message m and the randomization variable k uniformly at random from the respective sets 
Ai and /C. She encodes these as x n (m,k) and inputs the quantum codeword Pw mfc -, to 
the channel. Bob receives the state p B nt mk \ and performs a POVM {^m,k) t m ]a^mx.k that 
determines the pair m and k correctly with probability 1 — \fd. The code guarantees that 
Eve has almost no knowledge about the message m. The private communication rate P of 
the private code is equal to the following expression: 

P= - log 2 \M\ = I(X; B) - I(X; E) - 6c5. (22.70) 

n 

This concludes the proof of the direct coding theorem. 

We remark that the above proof applies even in the scenario where Eve does not get 
the full purification of the channel. That is, suppose that the channel has one input A' 
for Alice and two outputs B and E for Bob and Eve, respectively. Then the channel has 
an isometric extension to some environment F . In this scenario, the private information 
I(X;B) — I(X;E) is still achievable for some classical- quantum state input such that the 
Holevo information difference is positive. Though, one could always give both outputs E and 
F to an eavesdropper (this is the setting that we proved in the above theorem). Giving the 
full purification of the channel to the environment ensures that the transmitted information is 
private from the "rest of the universe" (anyone other than the intended receiver), and it thus 
yields the highest standard of security in any protocol for private information transmission. 
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22.4 The Converse Theorem 



We now prove the converse part of the private classical capacity theorem, which demonstrates 
that the regularization of the private information is an upper bound on the private classical 
capacity. We suppose instead that Alice and Bob are trying to accomplish the task of secret 



key generation. As we have argued in other converse proofs (see Sections 19.3.2 and 20.5), 
the capacity for generating this static resource can only be larger than the capacity for 
private classical communication because Alice and Bob can always use a noiseless private 
channel to establish a shared secret key. In such a task, Alice first prepares a max imally 
correlated state $ and encodes the M' variable as a codeword of the form in ( 22.1[ ). This 
encoding leads to a state of the following form, after Alice transmits her systems A' n over 
many independent uses of the channel: 

1 

\M 



U) 



MB n E n 



Bob finally applies a decoding map V 

MM'E n 



to recover his share of the secret key: 



(22.71; 



B n ^M' 



,B n ^Af 



MB n E n ^ 



If the code is good for secret key generation, then the following condition should hold 



,MM'E ri 



U) 



$ 



MM' 



a 



E n 



<e, 



so that Eve's state a En is a constant state independent of the secret key $ 
the above condition implies that Eve's information about M is small: 

I(M;E n ) w <ne', 

6e log d E 



MM' 



(22.73) 
In particular, 



(22.74) 

where we apply the Alicki-Fannes' inequality to (22.73) with e' = 6e log fig + 4H 2 (e)/n. The 
rate P — 8 of secret key generation is equal to -log|.M|. Consider the following chain of 
inequalities: 

n{P-S) = I(M;M% (22.75) 

< I(M; M') u + ne (22.76) 

< I(M; B 71 )^ + ne' (22.77) 

< I(M; B n )^ - I{M; E n )^ + 2ne (22.78) 
<P(A/" 0n ) +2ne'. (22.79) 



The first equality follows because the mutual information of the common randomness state 

5). The first inequality follows from applying the Alicki-Fannes' 



£ MM , 

<P is equal to n(P 



inequality to (22.73) with the above choice of e'. The second inequality is quantum data 



processing. The third inequality follows from (22.74), and the final inequality follows because 



the classical-quantum state in (22.71) has a particular distribution and choice of states, and 



this choice always leads to a value of the private information that cannot be larger than the 
private information of the tensor product channel J\f® n . 
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Exercise 22.4.1 Prove that free access to a forward public classical channel from Alice to 
Bob cannot improve the private classical capacity of a quantum channel. 

22.5 Discussion of Private Classical Capacity 

This last section discusses some important aspects of the private classical capacity. Two 



of these results have to do with the fact that Theorem |22.2.1| only provides a regularized 
characterization of the private classical capacity, and the last asks what rates of private 
classical communication are achievable if the sender and receiver share a secret key before 
communication begins. For full details, we refer the reader to the original papers in the 
quantum Shannon theory literature. 

22.5.1 Superadditivity of the Private Information 

Theorem |22.2.1| states that the private classical capacity of a quantum channel is equal to 
the regularized private information of the channel. As we have said before (at the beginning 



of Chapter 20), a regularized formula is not particularly useful from a practical perspective 
because it is impossible to perform the optimization task that it sets out, and it is not 
desirable from an information-theoretical perspective because such a regularization does not 
identify a formula as a unique measure of correlations. 

In light of the unsatisfactory nature of a regularized formula, is it really necessary to 



have the regularization in Theorem 22.2.1| for arbitrary quantum channels? Interestingly, 



the answer is "y es " i n the general case (though, we know it is not necessary if the channel 
is degradable). The reason is that there exists an example of a channel M for which the 
private information is strictly superadditive: 



in 



P{M)<P{M® m ), (22.80) 



for some positive integer m. Specifically, Smith et al. showed that the private information 
of a particular Pauli channel exhibits this superadditivity |231j . To do so, they calculated 
the private information P(Af) for such a channel. Next, they consider performing an m- 
qubit "repetition code" before transmitting qubits into the channel. A repetition code is a 
quantum code that performs the following encoding: 

a\0) + p\l) ^ a\0f m + p\lf m . (22.81) 

Evaluating the private information when sending a particular state through the repetition 
code and then through m instances of the channel leads to a higher value than mP(J\f), 



implying the strict inequality in (22.80). Thus, additivity of the private information for- 
mula P{M) cannot hold in the general case. 

The implications of this result are that we really do not understand the best way of 
transmitting information privately over a quantum channel that is not degradable, and it is 
thus the subject of ongoing research. 
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22.5.2 Superadditivity of Private Classical Capacity 

The private information of a particular channel can be superadditive (as discussed in the 
previous section), and so the regularized private information is our best characterization of 
the capacity for this information processing task. In spite of this, we might hope that some 
eventual formula for the private classical capacity would be additive (some formula other 
than the private information P(Af)). Interestingly, this is also not the case. 

To clarify this point, suppose that P ? (J\f) is some formula for the private classical capacity. 
If it were an additive formula, then it should be additive as a function of channels: 

P\N®M) =P\N) + P 1 (M). (22.82) 

Li et al. have shown that this cannot be the case for any proposed private capacity formula, by 



making a clever argument with a construction of channels 1183] . Specifically, they constructed 
a particular channel M which has a single-letter classical capacity. The fact that the channel's 
classical capacity is sharply upper bounded implies that its private classical capacity is 
as well. Let D be the upper bound so that P ? (Af) < D. Also, they considered a 50% 
erasure channel, one which gives the input state to Bob and an erasure symbol to Eve 
with probability 1/2 and gives the input state to Eve and an erasure symbol to Bob with 
probability 1/2. Such a channel has zero capacity for sending private classical information 
essentially because Eve is getting the same amount of information as Bob does on average. 
Thus, P ? (M) = 0. In spite of this, Li et al. show that the tensor product channel M <S> M. 
has a private classical capacity that exceeds D. We can then make the conclusion that these 
two channels allow for superadditivity of private classical capacity: 

P\N®M) > P\N) + P\M), (22.83) 



and that (22.82) cannot hold in the general case. More profoundly, their results demonstrate 



that the private classical capacity itself is non-additive, even if a characterization of it is 



found that is more desirable than that with the formula in Theorem 22.2.1. Thus, it will 
likely be difficult to obtain a desirable characterization of the private classical capacity for 
general quantum channels. 

22.5.3 Secret-key Assisted Private Classical Communication 



The direct coding part of Theorem 22.2.1 demonstrates how to send private classical informa- 
tion over a quantum channel M at the private information rate P(J\f). A natural extension 
to consider is the scenario where Alice and Bob share a secret key before communication 
begins. A secret key shared between Alice and Bob and secure from Eve is a tripartite state 
of the following form: 

$ AB ® a E , (22.84) 



where $ is the maximally correlated state and a E is a state on Eve's system that is inde- 
pendent of the key shared between Alice and Bob. Like the entanglement-assisted capacity 
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theorem, we assume that they obtain this secret key from some third party, and the third 
party ensures that the key remains secure. 

The resulting capacity theorem is known as the secret-key-assisted private classical capac- 
ity theorem, and it characterizes the trade-off between secret key consumption and private 
classical communication. The main idea for this setting is to show the existence of a protocol 
that transmits private classical information at a rate of I(X; B) private bits per channel use 
while consuming secret key at a rate of I(X; E) secret key bits per channel use, where the 



information quantities are with respect to the state in Theorem |22.2.1[ The protocol for 
achieving these rates is almost identical to the one we gave in the proof of the direct coding 
theorem, though with one difference. Instead of sacrificing classical bits at a rate of I(X; E) 
in order to randomize Eve's knowledge of the message (recall that our randomization variable 
had to be chosen uniformly at random from a set of size ~ 2 nI ( X ' E ^), the sender exploits 
the secret key to do so. The converse proof shows that this strategy is optimal (with a 
multi-letter characterization). Thus, we have the following capacity theorem. 

Theorem 22.5.1 (Secret-key-assisted capacity theorem). The secret-key-assisted private 
classical capacity region Cska{N) of a quantum channel M is given by 



CWCAO = U Z&SuSN**), (22.85) 
fe=i 

where the overbar indicates the closure of a set. C S x A (J\f) is the set of all P, S > such that 

P<I(X;B) a -I(X;E) (T + S, (22.86) 

P<I(X-B) a . (22.87) 

where P is the rate of private classical communication, S is the rate of secret key consump- 
tion, the state a XBE is of the following form 



o XBE = £>*(*) |*>(af ® U#-+ BE (pi'), (22.88) 



and Iffy ~^ BE is an isometric extension of the channel. 

Showing that the above inequalities are achievable follows by time-sharing between the 



protocol from the direct coding part of Theorem 22.2.1 and the aforementioned protocol for 



secret-key-assisted private classical communication. 

22.6 History and Further Reading 



Bennett and Brassard devised the first protocol for sending private classical data over a 
quantum channel [22]. The protocol given there became known as quantum key distribu- 
tion, which has now become a thriving field in its own right |212j . Devetak [68] and Cai, 
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Winter, and Yeung [5TJ proved the characterization of the private classical capacity given in 
this chapter (both using the techniques in this chapter). Hsieh et al. |157] proved achiev- 



ability of the secret-key-assisted protocol given in Section 22.5.3 and Ref. [248J proved the 
converse and stated the secret-key-assisted capacity theorem. Later work characterized the 
full trade-off between public classical communication, private classical communication, and 
secret key [1581 I252J . Smith et al. showed that the private information can exhibit super- 
additivity |231j . and Li et al. showed that the private classical capacity is generally non- 
additive [183] . Smith later showed that the symmetric-side-channel-assisted private classical 
capacity is additive |230j . Datta and Hsieh recently demonstrated universal private codes 
for quantum channels [61] . 
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Quantum Communication 



The quantum capacity theorem is one of the most important theorems in quantum Shannon 
theory. It is a fundamentally "quantum" theorem in that it demonstrates that a funda- 
mentally quantum information quantity, the coherent information, is an achievable rate for 
quantum communication over a quantum channel. The fact that the coherent information 
does not have a strong analog in classical Shannon theory truly separates the quantum and 
classical theories of information. 



The no-cloning theorem (Section 3.5.4) provides the intuition behind the quantum ca- 
pacity theorem. The goal of any quantum communication protocol is for Alice to establish 
quantum correlations with the receiver Bob. We know well now that every quantum channel 
has an isometric extension, so that we can think of another receiver, the environment Eve, 
who is at a second output port of a larger unitary evolution. Were Eve able to learn anything 
about the quantum information that Alice is attempting to transmit to Bob, then Bob could 
not be retrieving this information — otherwise, they would violate the no-cloning theorem. 
Thus, Alice should figure out some subspace of the channel input where she can place her 
quantum information such that only Bob has access to it, while Eve does not. That the 
dimensionality of this subspace is exponential in the coherent information is perhaps then 
unsurprising in light of the above no-cloning reasoning. The coherent information is an en- 
tropy difference H(B) — H(E) — a measure of the amount of quantum correlations that Alice 
can establish with Bob less the amount that Eve can gainjj 

We proved achievability of the coherent information for quantum data transmission in 



Corollary |21.2.1[ but the roundabout path that we followed to prove achievability there 
perhaps does not give much insight into the structure of a quantum code that achieves the 
coherent information. Our approach in this chapter is different and should shed more light 
on this structure. Specifically, we show how to make coherent versions of the private classical 
codes from the previous chapter. By exploiting the privacy properties of these codes, we can 
form subspaces where Alice can store her quantum information such that Eve does not have 
access to it. Thus, this approach follows the above "no-cloning intuition" more closely. 



Recall from Exercise lll.6.6l that we can also write the coherent information as half the difference of 
Bob's mutual information with Alice less Eve's: I{A)B) = 1/2 [I(A; B) - I (A; E)]. 
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The best characterization that we have for the quantum capacity of a general quantum 
channel is the regularized coherent information. It turns out that the regularization is not 
necessary for the class of degradable channels, implying that we have a complete understand- 
ing of the quantum data transmission capabilities of these channels. However, if a channel 
is not degradable, there can be some startling consequences, and these results imply that 
we have an incomplete understanding of quantum data transmission in the general case. 
First, the coherent information can be strictly superadditive for the depolarizing channel. 
This means that the best strategy for achieving the quantum capacity is not necessarily the 
familiar one where we generate random quantum codes from a single instance of a channel. 
This result is also in marked constrast with the "classical" strategies that achieve the unas- 
sisted and entanglement-assisted classical capacities of the depolarizing channel. Second, 
perhaps the most surprising result in quantum Shannon theory is that it is possible to "su- 
peractivate" the quantum capacity. That is, suppose that two channels on their own have 
zero capacity for transmitting quantum information (for the phenomenon to occur, these 
channels are specific channels). Then it is possible for the joint channel (the tensor product 
of the individual channels) to have a non-zero quantum capacity, in spite of them being 
individually useless for quantum data transmission. This latter result implies that we are 
rather distant from having a complete quantum theory of information, in spite of the many 
successes reviewed in this book. 

We structure this chapter as follows. We first overview the information processing task 
relevant for quantum communication. Next, we discuss the no-cloning intuition for quan- 
tum capacity in some more detail, presenting the specific example of a quantum erasure 



channel. Section 23.3 states the quantum capacity theorem, and the following two sections 



prove the direct coding and converse theorems corresponding to it. Section |23.6| computes 
the quantum capacity of two degradable channels: the quantum erasure channel and the 
amplitude damping channel. We then discuss superadditivity of coherent information and 



superactivation of quantum capacity in Section |23.7| Finally, we prove the existence of an 
entanglement distillation protocol, whose proof bears some similarities to the proof of the 
direct coding part of the quantum capacity theorem. 



23.1 The Information Processing Task 

We begin the technical development in this chapter by describing the information process- 
ing task for quantum communication (we define an (n, Q — 5,e) quantum communication 
code). First, there are several protocols that we can consider for quantum communication, 
but perhaps the strongest definition of quantum capacity corresponds to a task known as 
entanglement transmission. Suppose that Alice shares entanglement with a reference system 
to which she does not have access. Then their goal is to devise a quantum coding scheme 
such that Alice can transfer this entanglement to Bob. To this end, suppose that Alice and 
the reference share an arbitrary state \cp) 1 . Alice then performs some encoder on system 
A\ to prepare it for input to many instances of a quantum channel M A ^ B . The resulting 
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Reference 




Alice 



Figure 23.1: The information processing task for entanglement transmission. Alice is trying to preserve the 
entanglement with some inaccessible reference system by encoding her system and transmitting the encoded 
quantum data over many independent uses of a noisy quantum channel. Bob performs a decoding of the 
systems he receives, and the state at the end of the protocol is close to the original state shared between 
Alice and the reference if the protocol is any good for entanglement transmission. 



state is as follows: 

S M ^ A ' n {if RM ). (23.1) 

Alice transmits the systems A' n through many independent uses of the channel, resulting in 
the following state: 

M A ' n ^ Bn (£ A ^ A ' n (<p RAl )), (23.2) 

where J\f A n ^ Bn = (J\T A -^ s )® n . After Bob receives the systems B n from the channel outputs, 
he performs some decoding map V Bn ^ Bl , where E>i is some system of the same dimension 
as A\. The final state after Bob decodes is as follows: 



to 



RBi 



V 



B n ^B 



HAP 



*B n 



{E A ^ A {^ HAl ))). 



(23.3) 



Figure |23.1| depicts all of the above steps. 

If the protocol is good for quantum communication, then the following condition should 
hold for all states \ip) 1 : 

(23.4) 



,-RAi 



CO 



R-Bil 



< e. 



\\f ~ in 

The rate Q of this scheme is equal to the number of qubits transmitted per channel use: 

1 



Q = - log d Al + 5, 



n 



(23.5) 



where d Al is the dimension of the A\ register and 5 is an arbitrarily small positive number. 
We say that a rate Q is achievable if there exists an (n, Q — 5, e) quantum communication 
code for all e, 5 > and sufficiently large n. 

The above notion of quantum communication encompasses other quantum information 
processing tasks such as mixed state transmission, pure state transmission, and entanglement 
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generation. Alice can transmit any mixed or pure state if she can preserve the entanglement 
with a reference system. Also, she can generate entanglement with Bob if she can preserve 
entanglement with a reference system — she just needs to create an entangled state locally 
and apply the above protocol to one system of the entangled state. 

23.2 The No-Cloning Theorem and Quantum Commu- 
nication 

We first discuss quantum communication over a quantum erasure channel before stating and 
proving the quantum capacity theorem. Consider the quantum erasure channel that gives 
Alice's input state to Bob with probability 1 — e and an erasure flag to Bob with probability e: 

p^(l-e)p + e\e)(e\, (23.6) 

where (e|p|e) = for all inputs p. Recall that the isometric extension of this channel is as 



follows (see Exercise 5.2.6): 

m RA - VT^~e\r(j) RB \e) E + ^) RE \e) B , (23.7) 

so that the channel now has the other interpretation that Eve gets the state with probability 
e while giving her the erasure flag with probability 1 — e. 

Now suppose that the erasure parameter is set to 1/2. In such a scenario, the channel to 
Eve is the same as the channel to Bob, namely, both have the channel p — > l/2(p + |e)(e|). 
We can argue that the quantum capacity of such a channel should be zero, by invoking the 
no-cloning theorem. More specifically, suppose there is a scheme (an encoder and decoder 



as given in Figure 23.1) for Alice and Bob to communicate quantum information reliably at 
a non-zero rate over such a channel. If so, Eve could simply use the same decoder that Bob 
does, and she should also be able to obtain the quantum information that Alice is sending. 
But the ability for both Bob and Eve to decode the quantum information that Alice is 
transmitting violates the no-cloning theorem. Thus, the quantum capacity of such a channel 
should vanish. 

Exercise 23.2.1 Prove that the quantum capacity of an amplitude damping channel van- 
ishes if its damping parameter is equal to 1/2. 

The no-cloning theorem plays a more general role in the analysis of quantum communi- 
cation over quantum channels. In the construction of a quantum code, we are trying to find 
a "no-cloning" subspace of the input Hilbert space that is protected from Eve. If Eve is able 
to obtain any of the quantum information in this subspace, then this information cannot be 
going to Bob by the same no-cloning argument featured in the previous paragraph. Thus, 
we might then suspect that the codes from the previous chapter for private classical com- 
munication might play a role for quantum communication because we constructed them in 
such a way that Eve would not be able to obtain any information about the private message 
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that Alice is transmitting to Eve. The main insight needed is to make a coherent version 
of these private classical codes, so that Alice and Bob conduct every step in superposition 



(much like we did in Chapter 21). 



23.3 The Quantum Capacity Theorem 

The main theorem of this chapter is the following quantum capacity theorem. 

Theorem 23.3.1 (Quantum Capacity). The quantum capacity of a quantum channel M ~^ B 
is the supremum over all achievable rates for quantum communication, and one characteri- 
zation of it is the regularization of the coherent information of the channel: 

sup{Q | Q is achievable} = Q reg (AT), (23.8) 

where 

Q reg {M)=\im\Q(W k ). (23.9) 

k— >oo K 

The channel coherent information Q(N) is defined as 

Q(N) = Tasx.I(A)B) a , (23.10) 

where the optimization is over all pure, bipartite states cf) AA ' and 

We prove this theorem in two parts: the direct coding theorem and the converse theorem. 
The proof of the direct coding theorem proceeds by exploiting the private classical codes 
from the previous chapter. The proof of the converse theorem is similar to approaches from 
previous chapters — we exploit the Alicki-Fannes' inequality and quantum data processing in 
order to obtain an upper bound on the quantum capacity. In general, the regularized coherent 
information is our best characterization of the quantum capacity, but the regularization is not 
necessary for the class of degradable channels. Since many channels of interest are degradable 
(including dephasing, amplitude damping, and erasure channels), we can calculate their 
quantum capacities. 

23.4 The Direct Coding Theorem 

The proof of the direct coding part of the quantum capacity theorem follows by taking ad- 
vantage of the properties of the private classical codes constructed in the previous chapter 



(see Section 22.3). We briefly recall this construction. Suppose that a classical- quantum- 
quantum channel connects Alice to Bob and Eve. Specifically, if Alice inputs a classical 
letter x to the channel, then Bob receives a density operator p B and Eve receives a density 
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operator u)% . The direct coding part of Theorem 22.2.1 establishes the existence of a code 



book {x n (m, k)} m&M k&K selected from a distribution px(x) and a corresponding decoding 
POVM {A^" fe j such that Bob can detect Alice's message m and randomizing variable k with 
high probability: 

VmeMikeJC: Tr{A^>f: (mjfc) } > 1 - e, 

while Eve obtains asymptotically zero information about Alice's message m: 



(23.12) 



Vm G M : 



± V^ A rpn 

jr\ / j LO x n (m,k) 



\IC 



U) 



k&K, 



<e. 



where u) is Eve's expected density operator: 



UJ 



The above statements hold true for all e > and sufficiently large n as long as 

^1 ^ 2 n[I{X;B)-I(X;E)} ^ 
|£| ^ 2^I{X\E)^ 



(23.13) 



(23.14) 



(23.15) 
(23.16) 



We can now construct a coherent version of the above code that is good for quantum data 
transmission. First, suppose that there is some density operator with the following spectral 
decomposition: 

(23.17) 



P 



A' 



^2p x (x)\ij} x )(ip x \ A . 



Now suppose that the channel M A ~^ B has an isometric extension Ufy^ BE , so that inputting 



,a' 



\ip x ) leads to the following state shared between Bob and Eve: 



(23.18) 



From the direct coding part of Theorem |22.2.1[ we know that there exists a private classical 
code {x n (m, k)} meM k&K with the properties in (23.12 23.13) and with rate 



I(X;B) a -I(X;E) a , 

where a XBE is a classical-quantum state of the following form: 

^ L:E = J2px(x)\x){x\ X ®\A)M BE - 



a 



(23.19) 



(23.20) 



The following identity demonstrates that the private information in (23.19) is equal to the 
coherent information for the particular state a XBE above: 

I(X; B) a - I(X; E) a = H(B) a - H{B\X) a - H(E) a + H{E\X) a (23.21) 

= H(B) a - H(B\X) a - H(E) a + H{B\X) a (23.22) 

= H(B) a -H(E) a . (23.23) 
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The first equality follows from the identity I(C; D) = H(D) — H(D\C). The second equality 
follows because the state on systems BE is pure when conditioned on the classical variable X. 



Observe that the last expression is a function solely of the density operator p in (23.17), and 



it is also equal to the coherent information of the channel for the particular input state p 



(see Exercise 11.5.2). 

Now we show how to construct a quantum code achieving the coherent information rate 
H(B) a — H(E) a by making a coherent version of the above private classical code. Suppose 



that Alice shares a state \ip) 



RA 1 



with a reference system f?, where 



LmeM 



J*>V>* 



(23.24) 



and {\l) } is some orthonormal basis for R while {\m) *} is some orthonormal basis for A\. 

Also, we set \J\A\ ~ 2 n \- ^ ' a ~ ^ M. We would like for Alice and Bob to execute a quantum 
communication protocol such that Bob can reconstruct Alice's share of the above state on 
his system with Alice no longer entangled with the reference (we would like for the final state 
to be approximately close to \ip) x where Bob is holding the A\ system). To this end, Alice 



creates a quantum codebook {|</> m ) 



A" 



^) 



A'" 



}m&M with quantum codewords: 
1 






,n„ 



'\Yx n (m,k)) J 



(23.25) 



where the states \ip x n (m,k)) An are the n th extensions of the states arising from the spectral 



decomposition in (23.17), the classical sequences x n (m, k) are from the codebook for private 
classical communication, and we specify how to choose the phases 7 mi fc later. All the states 
\ipx n {m,k)) An ar 6 orthonormal because they are picked from the spectral decomposition in 



(23.17) and the expurgation from Section 22.3.4 guarantees that they are distinct (otherwise, 
they would not be good codewords!). The fact that the states \ifi x n (m,k)) " are orthonormal 
implies that the quantum codewords \4> m ) are also orthonormal. 

Alice's first action is to coherently copy the value of m in the A\ register to another 
register A 2 , so that the state in ( |23.24[ ) becomes 



J2 a 'v 



|;\fl| \Ai\ \A 2 

JZ) \m) l \m) 



(23.26) 



Alice then performs some isometric encoding from A 2 to A' n that takes the above unencoded 
state to the following encoded state: 



^ ai, m \l) R \m) Al \(j) m y 

l, in 



(23.27) 



,A'" 



where each \4> m ) is a quantum codeword of the form in (23.25). Alice transmits the systems 
A ln through many uses of the quantum channel, leading to the following state shared between 
the reference, Alice, Bob, and Eve: 



E 

Lin 



Oil, 



,\D R \m) M 



B n E n 



(23.28) 
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( B"E" . 



where \4> m ) is defined from (23.18) and (23.25). Recall from (23.12) that Bob can detect 



the message m and the variable k in the private classical code with high probability: 

Vm, k : Tr{A^>f: (m , fc) } > 1 - e. (23.29) 

So Bob instead constructs a coherent version of this POVM: 

J2 V^Zk Bn ®\m) Bl \k) B \ 

m£M,k£K. 

He then performs this coherent POVM, resulting in the state 



(23.30) 



E EE^^i | )V)" l v / A m ',/"e ! ^i^ M ) B " £ >'fc') BiB2 . 

m'£M, l,m k£K V I I 
k'eK 

We would like for the above state to be close in trace distance to the following state: 

l,m fc£/C V I I 



(23.31) 



(23.32) 



where 8 m ^ are some phases that we will specify shortly. To this end, consider that the sets 
{\Xm,k) BnEnBlB2 }m,k and {\(p m ,k) BnEnBlB2 }m,k form orthonormal bases, where 

^B n E n B 1 B 2 _ |„/, \B n E n \^\B u , A B 2 

/ _, \/^Tn',k' \Vx n (m,k)) \Tn) 1 \k) 2 . 

m'eA4,fc'£/C 

Also, consider that the overlap between corresponding states in the different bases is high: 



I viJ--.c--.Di.D2 _ I / \B"t]"\ rr ^\a 1 \,\D 2 

\Xm,k) = Wx n (m,k)) \ m ) | fe / , 

I \B n E n B 1 B 2 _ \~^ n B '\ i 

Wm,k) = / . V ly m',k' \Wx n (rr 



(23.33) 

(23.34) 



(Xm,k\^Pm,k) 

I i \B n E n I i-Bi /; \B 2 



(kf 2 J2 V^ \^ {m ,k)) BnEn \m') Bl \k') B2 

m'eM,k'elC 

2_^ Wx n (m,k)\ \/ A rn>,ki \4>x"(m,k)) {m\m } (ft|ft ) 



m'eM.k'eK, 



= vPx n (m,k)\ V A-m,k \Wx n (m,k)) 

-> 1. 1. \B n E n K B n 1 /, \B n E n 

> Wz n (m,fc)l A m#i"(m,t)) 

= Tr{A mA .-i/) :!; „( m j,)} 

> 1-e. 



(23.35) 

(23.36) 

(23.37) 
(23.38) 
(23.39) 
(23.40) 



where the first inequality follows from the fact that \J ly m ,k > A^" fc f° r ^m'/t — ^ an d the 
second inequality follows from (23.29). By applying Lemma A. 0.4 from Appendix |AJ we 
know that there exist phases ^m.k and 8 m ^ such that 



{XmWm) > 1 



(23.41) 
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where 



\Xm) 



B n E n B 1 B 2 



\ K \ k 



,iS„ 



\Xm,k) 



B n E n B 1 B 2 



B n E n B 1 B 2 



(23.42) 

1^)--™* = - ^Je^.*!^,)— 1-». (23.43) 

V 1^1 fc 

So we choose the phases in a way such that the above inequality holds. We can then apply 



the above result to show that the state in (23.31) has high fidelity with the state in (23.32): 



V^ * n\Ri \Aii ,B n E n B 1 B 2 \ | V^ U/\-R| '\^i| \B n E n B 1 B 2 \ 

2^ a l,m( l \ H (Xm\ I I 2_^ a l'M l ) \ m ) Wm') I 

/ \l',m' J 

E al m a v , nl {l\l') R {m\m') Al (x m \ Vm ') BnEnBlB2 (23.44) 

n,l',m' 

V^i |2/ I \B n E n B 1 B 2 , no A r\ 

2_^\ a l,m\ {Xm\<Pm) (23.45) 

(23.46) 



Lm 



l,m,l' 



l,m 

> 1-e. 



Thus, the state resulting after Bob performs the coherent POVM is close in trace distance 
to the following state: 






l,m&M 



where 



£ a*,J0V>^<U B " £nB V> Bl , (23.47) 



!,meA4 



>r £BB2 = ^E e ^vwW B " B >> B2 - 



k&K, 



Consider the state of Eve for a particular value of m: 



Tr B n B2 {|0 m )(0 m | Bn£nB2 } 






|£| / ^x n {m,k)- 



(23.48) 

(23.49) 
(23.50) 

(23.51) 



k&K. 



We are now in a position to apply the second property of the private classical code. Recall 



from the privacy condition in (|23.13|) that Eve's state is guaranteed to be e-close in trace 



distance to the tensor power state 



sA'^E, 



A'^E 



(J\f c ) ~* (p) , where (A/" c ) ~~* is the complementary 
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channel and p is the density operator in (23.17). Let |0/v=(p)) De some purification of 



this tensor power state. By Uhlmann's theorem and the relation between trace distance and 



fidelity (see Definition 9.2.3 and Theorem 9.3.1), there is some isometry U BnB2 ^ B3 for each 



value of m such that the following states are 2-^/e-close in trace distance (see Exercise 9.2.7): 



^T^I'U*™ * I^(p)>™ (23.52) 



Bob's next step is to perform the following controlled isometry on his systems B n , Bi, and 

J2\™)(m\ Bl ® UZ B ^ B \ (23.53) 

leading to a state that is close in trace distance to the following state: 

( J2 ^ m \l) R \m) M \m) B A®\e mp) ) EnB \ (23.54) 

At this point, the key observation is that the state on E n B^ is effectively decoupled from the 
state on systems R, A\, and B\, so that Bob can just throw away his system B%. Thus, they 
have successfully implemented an approximate coherent channel from system A\ to A\B\. 

We now allow for Alice to communicate classical information to Bob in order for them to 
implement a quantum communication channel rather than just a mere coherent channel (in 
a moment we argue that this free forward classical communication is not necessary). Alice 
performs a Fourier transform on the register A\, leading to the following state: 

—= V ai , m exp{27rimj/d Al }\l) R \j) A i\m) Bl . (23.55) 



l,m,j£M 

She then measures register Ax in the computational basis, leading to some outcome j and 
the following post-measurement state: 

f J2 a ltm exp{27rimj/d Al }\lf\m) Bl ) ® |j) Al . (23.56) 

\l,m&M J 

She sends Bob the outcome j of her measurement over a classical channel, and the protocol 
ends with Bob performing the following unitary 

Z\j)\m) Bl = exio{-2mmj/d Al }\m) Bl , (23.57) 

leaving the desired state on the reference and Bob's system Bi m . 

J2 oc hm \l) R \m) B \ (23.58) 

All of the errors accumulated in the above protocol are some finite sum of e terms, and 
applying the triangle inequality several times implies that the actual state is close to the 
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Reference 




Figure 23.2: All of the steps in the protocol for quantum communication. Alice and Bob's goal is to 
communicate as much quantum information as they can while making sure that Eve's state is independent of 
what Alice is trying to communicate to Bob. The figure depicts the series of controlled unitaries that Alice and 
Bob perform and the final measurement and classical communication that enables quantum communication 
from Alice to Bob at the coherent information rate. 



desired state in the asymptotic limit of large block length. Figure |23.2| depicts all of the 
steps in this protocol for quantum communication. 

We now argue that the classical communication is not necessary — there exists a scheme 
that does not require the use of this forward classical channel. After reviewing the above 



protocol and glancing at Figure 23.2 we realize that Alice's encoder is a quantum instrument 
of the following form: 

S(p)=J2^(p)^\J)(J\- (23-59) 



Each map £ j (p) is a trace-reducing map of the following form: 

zM RAl ) = {j\ MpAl (Ei^'>< m 'i*) ( Ehh^ 1 « 

\ m' / \ rn 



RAi 



\ m r m KA \ 



(23.60) 



where ^ m l m )( m l 1 ® \ m ) 2 * s Alice's coherent copier in (23.26), ^m'l^m'X 777 -'! 2 * s ner 
quantum encoder in (23.27), F Al is the Fourier transform, and (j\ 1 represents the projection 



onto a particular measurement outcome j. We can simplify the above map as follows: 



m' \ m 

= (j\ Ai J2\™)( m \ Al ®\ ( t>™Y 



l A 2 \i\RA 1 






m], \ M \\(t) m ) A2 i™\ M 



r 2 (m\ 



,RAi 



,RAi 



(23.61) 

(23.62) 

(23.63) 
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It follows that the trace of each Ej is uniform and independent of the input state 

Tr{£,(^)} ' 



,RA 1 



\M\ 



(23.64) 



Observe that multiplying the map in (23.63) by a/|.M| gives a proper isometry that could 
suffice as an encoding. Let £'• denote the rescaled isometry. Corresponding to each encoder 



is a decoding map T>~ consisting of Bob's coherent measurement in (23.30), his decoupler in 



(23.53), and his phase shifter in (23.57). We can thus represent the state output from our 



classically-coordinated protocol as follows: 



I>(^te(^)))- 



(23.65) 



From the analysis in the preceding paragraphs, we know that the trace distance between the 
ideal state and the actual state is small for the classically-coordinated scheme: 



^^(^"(^v^ 1 )))-^ 1 



<e' 



(23.66) 



where e' is some arbitrarily small positive number. Thus, the fidelity between these two 
states is high: 



F £2^A^V^))) ; 



<P 



RAi 



> 1 - e'. 



(23.67) 



But we can rewrite the fidelity as follows: 



Fhr^A/^v^)))- 



v 



i?Ai 



3 
3 

3 

>l-e', 



(23.68) 
(23.69) 

(23.70) 
(23.71) 



implying that at least one of the encoder-decoder pairs (Ej,Dj) has asymptotically high 
fidelity. Thus, Alice and Bob simply agree beforehand to use a scheme (SLDj) with high 
fidelity, obviating the need for the forward classical communication channel. 

The protocol given here achieves communication at the coherent information rate. In 
order to achieve the regularized coherent information rate in the statement of the theorem, 
Alice and Bob apply the same protocol to the superchannel (Af A ^ B )® fc instead of the channel 
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23.5 Converse Theorem 

This section proves the converse part of the quantum capacity theorem, demonstrating that 
the regularized coherent information is an upper bound on the quantum capacity of any 
quantum channel. For the class of degradable channels, the coherent information itself is an 
upper bound on the quantum capacity — this demonstrates that we completely understand 
the quantum data transmission capabilities of these channels. 

For this converse proof, we assume that Alice is trying to generate entanglement with 
Bob. The capacity for this task is an upper bound on the capacity for quantum data trans- 
mission because we can always use a noiseless quantum channel to establish entanglement. 
We also allow Alice free forward classical communication to Bob, and we demonstrate that 
this resource cannot increase the quantum capacity (essentially because the coherent infor- 
mation is convex). In a protocol for entanglement generation, Alice begins by preparing the 
maximally entangled state <f> AAl of Schmidt rank 2 n< ^ in her local laboratory, where Q is the 
rate of this entangled state. She performs some encoding operation £ A i^ AnM that outputs 
many systems A' n and a classical register M. She then inputs the systems A' n to many 
independent uses of a noisy quantum channel Af A ^ B , resulting in the state 

U AMBU = U A^B^ £ A^A^M^A Al) ^ (23 _ 72) 

where j\f An ^ Bn = ^j\fA^B^®n^ g ^ takes the outputs B n of the channels and the classical 
register M and performs some decoding operation j) BnM ^ Bl , resulting in the state 

(u;') ABl = V BnM ^(u; AMBn ). (23.73) 

If the protocol is any good for entanglement generation, then the following condition should 
hold 

(uj') ABi ~ ® ABl < e, (23.74) 

where e is some arbitrarily small positive number. 

The converse proof then proceeds in the following steps: 

nQ = I(A)B 1 ) (23.75) 

<I(A)B 1 ) cjl + ne' (23.76) 

< I(A)B n M) uj + ne'. (23.77) 

The first equality follows because the coherent information of a maximally entangled state 
is equal to the logarithm of the dimension of one of its systems. The first inequality follows 



from an application of the Alicki-Fannes' inequality to the condition in (23.74), with e = 
4eQ + 2H 2 (e)/n. The second inequality follows from quantum data processing. Now consider 

that the state U j AMBn is a classical- quantum state of the following form: 



AMB n 



Y,PM{m)\m){m\ M ®N A ' n - Bn {p A m A ' n ). (23.78) 
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We can then perform a spectral decomposition of each state p m as follows: 

P A S = Y,Pw(l\™)\<f>l,rn)(<f>l,m\ AA ' n , (23.79) 

I 

and augment the above state as follows: 

AMLBn = J2PMMPL\M(l\m)\m)(m\ M ® |0(iN^'"^(C). ( 23 - 80 ) 



m.l 



so that ijj AMBn = Tri{u; AMLB "}. We continue with bounding the rate Q: 

I(A)B n M) uj + ne' < I(A)B n ML) uj + ne' 

= ^2pM(rn)p L i M {l\m)I(A)B n ) AfA ,n^ B n {(j) AA^ ) + ne 

m,l 
< I(A)B n ) AfA ,n^ B n^ (t> ^AA'n ) + Tit' 

<Q{N® n )+ne'. 

The first inequality follows from the quantum data processing inequality. The first equality 
follows because the registers M and L are both classical, and we can apply the result of 



(23.81) 


(23.82) 


(23.83) 


(23.84) 



Exercise 11.5.5. The second inequality follows because the expectation is always less than 
the maximal value (where we define <jf to be the state that achieves this maximum). The final 
inequality follows from the definition of the channel coherent information as the maximum 
of the coherent information over all pure, bipartite inputs. This concludes the proof of the 
converse part of the quantum capacity theorem. 

There are a few comments we should make regarding the converse theorem. First, we 
see that classical communication cannot improve quantum capacity because the coherent 
information is convex. We could obtain the same upper bound on quantum capacity even if 
there were no classical communication. Second, it is sufficient to consider isometric encoders 
for quantum communication — that is, it is not necessary to exploit general noisy CPTP maps 
at the encoder. This makes sense intuitively because it would seem odd if noisy encodings 



could help in the noiseless transmission of quantum data. Our augmented state in (23.80 ) and 
the subsequent development reveals that this is so (again because the coherent information 
is convex). 

We can significantly strengthen the statement of the quantum capacity theorem for the 
class of degradable quantum channels because the following inequality holds for them: 

Q{N® n ) < nQ{N). (23.85) 

This inequality follows from the additivity of coherent information for degradable channels 



(Theorem 12.5.1). Also, the task of optimizing the coherent information for these channels 



is straightforward because it is a concave function of the input density operator (Theo- 



rem 12.5.2) and the set of density operators is convex. 
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23.6 Example Channels 



We now show how to calculate the quantum capacity for two exemplary channels: the 
quantum erasure channel and the amplitude damping channel. Both of these channels are 
degradable, simplifying the calculation of their quantum capacities. 



23.6.1 The Quantum Erasure Channel 

Recall that the quantum erasure channel acts as follows on an input density operator p A ' : 

p A ' ^(l-e)p B + e\e)(e\ B , (23.86) 

where e is the erasure probability and |e) is an erasure state that is orthogonal to the 
support of any input state p. 

Proposition 23.6.1. The quantum capacity of a quantum erasure channel with erasure 
probability e is 

(l-2e)logd A , (23.87) 

where oIa is the dimension of the input system. 

Proof. To determine the quantum capacity of this channel, we need to compute its coherent 



information, and we can do so in a similar way as we did in Proposition 20.6.1. So, consider 



that sending half of a pure, bipartite state (j) AA through the channel produces the output 

a AB = (1 - e)(f) AB + e(f) A ® \e){e\ B . (23.88) 

Recall that Bob can apply the following isometry JJ B ^ BX to his state: 

jjB^BX _ U B ^ JQ^X + | e ^ e |B ^ |^X (23 _ g9) 

where T1 B is a projector onto the support of the input state (for qubits, it would be just 
|0}(0| + |1}(1|). Applying this isometry leads to a state a ABX where 

a ABX = uB ^BX a AB(jjB^BXy (23 _ go) 

= (1 - e)(f) AB ® |0}(0| x + t(f) A ® |e)(e| B ® |1}(1| X . (23.91) 

The coherent information I(A)BX) a is equal to I(A)B) a because entropies do not change 
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under the isometry U . We now calculate 1(A) BX) a : 



= (1 - e) [ff(B), - . 


H(AB)^ 






+ e 


#(%> - 


H ( AB )cj > A <S)\e)(e\ 




= (1 - e)fT(B), - e 


H(A)^ + H(B) le) 


= {l-2e)H{A)+ 




< (1 - 2e 


)logd A - 









I{A)BX) a = H(BX) a - H{ABX) a (23.92) 

= H(B\X) a -H(AB\X) a (23.93) 



(23.94) 

(23.95) 

(23.96) 
(23.97) 

The first equality follows by the definition of coherent information. The second equality 
follows from (j) A = Ttbx{c abx }, from the chain rule of entropy, and by canceling H(X) on 
both sides. The third equality follows because the X register is a classical register, indicating 
whether the erasure occurs. The fourth equality follows because H(AB)< = 0, H(B)<\ = 0, 
and If (AB), A(8 | e w e | = H(A)j+H(B)<y The fifth equality follows again because H(B), e -. = 0, 
by collecting terms, and because H(A), = H(B), ((f) AB is a pure bipartite state). The final 
inequality follows because the entropy of a state on system A is never greater than the 
logarithm of the dimension of A. We can conclude that the maximally entangled state $ AA 
achieves the entanglement-assisted classical capacity of the quantum erasure channel because 
H(A) q> = \ogd A . U 

23.6.2 The Amplitude Damping Channel 

We now compute the quantum capacity of the amplitude damping channel A/ad- Recall that 
this channel acts as follows on an input qubit in state p: 

A/ad(p) = AopAl + A lP A\, (23.98) 

where 

^ = |0)(0| + v / ^|l)(l|, A x = Vt|0)(1|. (23.99) 



The development here is similar to development in the proof of Proposition 20.6.2 



Proposition 23.6.2. The quantum capacity of an amplitude damping channel with damping 
parameter 7 is 

Q(XT AD ) = max # 2 ((1 - 7 )p) - H 2 { 1P ), (23.100) 

pe[o,i] 

whenever 7 < 1/2. Otherwise, the quantum capacity is zero. Recall that H 2 (x) is the binary 
entropy function. 
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P 



Proof. Suppose that a matrix representation of the input qubit density operator p in the 
computational basis is 

1- p i]* 
77 p 

One can readily verify that the density operator for Bob has the following matrix represen- 
tation: 

i - (i - 7 )p Vi - it 
y/i-Tn (i - i)p . 

and by calculating the elements Tr{AipA'-}\i) (j\, we can obtain a matrix representation for 
Eve's density operator: 

1 - 7p ^777* 



A/ad(p) 



(23.101) 



(23.102) 



A^ D (p) 



(23.103) 



where A/"a D is the complementary channel to Eve. By comparing (23.102) and (23.103), we 



can see that the channel to Eve is an amplitude damping channel with damping parameter 
1 — 7. The quantum capacity of A/ad is equal to its coherent information: 



Q(Af AD )=maxI(A)B) a , 

1 A A 1 



(23.104) 



where (j) AA is some pure bipartite input state and a AB = A/ad(^ AA )• We need to determine 
the input density operator that maximizes the above formula as a function of 7. So far, the 
optimization depends on three parameters: p, Rejr?}, and lm{r]}. We can show that it is 



sufficient to consider an optimization over only p with 77 = 0. The formula in (23.104) also 
has the following form: 



because 



Q(A/ad) = max[#(A/A D (p)) " #(JVad(p))], 
p 



I{A)B) a = H{B) a -H{AB) a 

= H(N- AD (p))-H(E) a 

= H(Af AD (p))-H(AfUp)) 

= /coh(P,A/AD)- 



(23.105) 



(23.106) 
(23.107) 
(23.108) 
(23.109) 



The two entropies in (23.105) depend only on the eigenvalues of the two density operators 



in (23.102 23.103), respectively, which are as follows 

1 



1±V( 1 - 2 ( 1 -7)p) 2 + 4| ? 7| 2 (1-7) 



1± V(1-27P) 2 + 4M 2 7 ). 



(23.110) 
(23.111) 



The above eigenvalues are in the order of Bob and Eve. All of the above eigenvalues have a 
similar form, and their dependence on 77 is only through its magnitude. Thus, it suffices to 



©2012 Mark M. Wilde — This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 
Unported License 



584 CHAPTER 23. QUANTUM COMMUNICATION 

consider r] G R (this eliminates one parameter). Next, the eigenvalues do not change if we 
flip the sign of rj (this is equivalent to rotating the original state p by Z, to ZpZ), and thus, 
the coherent information does not change as well: 

/coh(p,AT A D) = I C oh{ZpZ,Af AD ). (23.112) 

By the above relation and concavity of coherent information in the input density operator 



for degradable channels (Theorem 12.5.2), the following inequality holds 



/coh(p, AT AD ) = -[/ coh (p,A/A D ) + Icoh(ZpZ, A/a D )] (23.113) 

<I CO h(.l(p + ZpZ), Mad) (23.114) 

= / coh (A(p),A/A D ), (23.115) 

where A is a completely dephasing channel in the computational basis. This demonstrates 
that it is sufficient to consider diagonal density operators p when optimizing the coherent 



information. Thus, the eigenvalues in (23.110 23.111) respectively become 



{(l- 7 )p,l-(l-7)p}, (23.116) 

{ 7 p,l-7P}, (23.117) 

giving our final expression in the statement of the proposition. □ 

Exercise 23.6.1 Consider the dephasing channel: p — > (1 — p/2)p + (p/2)ZpZ. Prove that 
its quantum capacity is equal to 1 — H 2 (p/2), where p is the dephasing parameter. 

23.7 Discussion of Quantum Capacity 

The quantum capacity is particularly well-behaved and understood for the class of degradable 
channels. Thus, we should not expect any surprises for this class of channels. If a channel is 
not degradable, we currently cannot say much about the exact value of its quantum capacity, 
but the study of non- degradable channels has led to many surprises in quantum Shannon 
theory and this section discusses two of these surprises. The first is the superadditivity of 
coherent information for the depolarizing channel, and the second is a striking phenomenon 
known as super activation of quantum capacity, where two channels that individually have 
zero quantum capacity can combine to make a channel with non-zero quantum capacity. 

23.7.1 Superadditivity of Coherent Information 

Recall that the depolarizing channel transmits its input with probability 1 — p and replaces 
it with the maximally mixed state it with probability p: 

p^ {l-p)p + pir. (23.118) 
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We focus on the case where the input and output of this channel is a qubit. The depolarizing 
channel is an example of a quantum channel that is not degradableQ As such, we might 
expect it to exhibit some strange behavior with respect to its quantum capacity. Indeed, it 
is known that its coherent information is strictly superadditive when the channel becomes 
very noisy: 

5Q(A0 <Q(A/" 05 ). (23.119) 

How can we show that this result is true? First, we can calculate the coherent information 
of this channel with respect to one channel use. It is possible to show that the maximally 
entangled state <& AA maximizes the channel coherent information Q(J\f), and thus 

Q{N) = H(B)s - H(AB) m) (23.120) 

= 1-H(AB)„ W , (23.121) 

where H(B) <S> = 1 follows because the output state on Bob's system is the maximally mixed 
state whenever the input to the channel is half of a maximally entangled state. In order to 
calculate H(AB)j^,^, observe that the state on AB is 

(1 - p)<5> AB + pir A ® tt b = (1 - p)<$> AB + V -I AB (23.122) 

= (1 - p)<$> AB + V - ( [I AB - <$> AB ] + <$> AB ) (23.123) 

= ( 1 _ 3 l\ $AB + P (jAB _ $ AB) (23 124) 

Since $ AB and I AB — $ AB are orthogonal, the eigenvalues of this state are 1 — 3p/4 with 
multiplicity one and p/4 with multiplicity three. Thus, the entropy H{AB)j^,^ is 

*MBW>--(i-?Mi -?)-?<)• ™ 

and our final expression for the one-shot coherent information is 

«{A0 - 1 + (l - f ) k«(l - f) + ?lo»(f )■ P3.«) 

Another strategy for transmitting quantum data is to encode half of the maximally 
entangled state with a five-qubit repetition code: 



±=(\m) AM + \n) AM 



l -(\mmm) AAlA2MAAA5 + iiimi)^ 1 ^^), (23.127) 



V2 



2 Ref. [232 gives an explicit condition that determines whether a channel is degradable. 
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Linear Scale 



Logarithmic Scale 




0.2 0.4 0.6 0. 
Depolarizing parameter 



0.24 0,245 0.25 0.255 0.26 
Depolarizing parameter 



Figure 23.3: The figures plot the coherent information in (23.126) (in green) and that in (23.128) (in blue) 
versus the depolarizing noise parameter p. The figure on the left is on a linear scale, and the one on the 
right is on a logarithmic scale. The notable features of the figure on the left are that the quantum data rate 
of the green curve is equal to one and the quantum data rate of the blue curve is 1/5 when the channel is 
noiseless (the latter rate is to be expected for a five-qubit repetition code). Both data rates become small 
when p is near 0.25, but the figure on the right reveals that the repetition code concatenation strategy still 
gets positive coherent information even when the rate of the random coding strategy vanishes. This is an 
example of a channel for which the coherent information can be superadditive. 



and calculate the following coherent information with respect to the state resulting from 
sending the systems A\ ■ ■ ■ A§ through the channel: 



I(A)B 1 B 2 B 3 B 4 B 5 ). 



(23.128) 



(We normalize the above coherent information by five in order to make a fair comparison 
between a code achieving this rate and one achieving the rate in (23.126).) We know that the 



rate in (23.128) is achievable by applying the direct part of the quantum capacity theorem to 



the channel A/"® 5 , and operationally, this strategy amounts to concatenating a random quan- 
tum code with a five-qubit repetition code. The remarkable result is that this concatenation 
strategy can beat the one-shot coherent information when the channel becomes very noisy. 



Figure 23.3 demonstrates that the concatenation strategy has positive coherent informa- 



tion even when the one-shot coherent information in (23.126) vanishes. This demonstrates 
superadditivity of coherent information. 

Why does this phenomenon occur? The simplest (though perhaps not completely satis- 
fying) explanation is that it results from a phenomenon known as degeneracy. Consider a 
qubit a|0) + j3\l) encoded in a repetition code: 



a|00000>+/?|lllll>. 



(23.129) 



If the "error" Z\ <S> Z 2 occurs, then it actually has no effect on this state. The same holds 
for other two-qubit combinations of Z errors. When the channel noise is low, degeneracy of 
the code with respect to these errors does not help very much because these two-qubit error 
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combinations are less likely to occur. Though, when the channel becomes really noisy, these 
errors are more likely to occur, and the help from degeneracy of the repetition code offsets 
the loss in rate. 

It is perhaps strange that the coherent information of a depolarizing channel behaves in 
this way. The channel seems simple enough, and we could say that the strategies for achieving 
the unassisted and entanglement-assisted classical capacity of this channel are very "classical" 
strategies. Recall that the best strategy for achieving the unassisted classical capacity is to 
generate random codes by picking states uniformly at random from some orthonormal basis, 
and the receiver measures each channel output in this same orthonormal basis. For achieving 
the entanglement-assisted classical capacity, we choose a random code by picking Bell states 
uniformly at random and the receiver measures each channel output and his half of each 
entangled state in the Bell basis. Both of these results follow from the additivity of the 
respective capacities. In spite of these other results, the best strategy for achieving the 
quantum capacity of the depolarizing channel remains very poorly understood. 



23.7.2 Superactivation of Quantum Capacity 

Perhaps the most startling result in quantum communication is a phenomenon known as 
superactivation. Suppose that Alice is connected to Bob by a quantum channel Mi with zero 
capacity for transmitting quantum data. Also, suppose that there is some other zero quantum 
capacity channel A/2 connecting them. Intuitively, we would expect that Alice should not be 
able to transmit quantum data reliably over the tensor-product channel M\ <8> M 2 - That is, 
using these channels in parallel seems like it should not give any advantage over using the 
individual channels alone if they are both individually useless for quantum data transmission 
(this is the intuition that we have whenever a capacity formula is additive). But two examples 
of zero-capacity channels are known that can superactivate each other, such that the joint 
channel has a non-zero quantum capacity. How is this possible? 

First, consider a 50% quantum erasure channel A/"i that transmits its input state with 
probability 1/2 and replaces it with an erasure state with probability 1/2. As we have argued 
before with the no-cloning theorem, such a channel has zero capacity for sending quantum 
data reliably. Now consider some other channel A/*2. We argue in the proof of the following 
theorem that the coherent information of the joint channel M\ <8> A/2 is equal to half the 
private information of A/2 alone. 

Theorem 23.7.1. Let {px(x), p^ 2 } be an ensemble of inputs for the channel A/2, and let Mi 
be a 50% erasure channel. Then there exists a state (p AlA - 2 such that the coherent information 
H(BiB 2 ) — H(EiE 2 ) of the joint channel is equal to half the private information I(X; B 2 ) — 
I(X; E 2 ) of the second channel: 

H(B 1 B 2 ) u; - HiE.E,)^ = I \l(X; B 2 ) p - I(X; E 2 )~\ , (23.130) 



2L 
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where 

uBi b, Ei e 2 _ {u ^ U^)(^ M ), (23.131) 

p xs 2 E 2 = £ pjf(x) | x)(;E |* ® [^ft*^), (23.132) 

a; 

and C7/vi «^ t//v 2 are t/ie respective isometric extensions of J\f\ and J\f 2 . 

Proof. Consider the following classical-quantum state corresponding to the ensemble {px(x), p^ 2 }: 

p xA 2 _ J2px(x)\x){x\ x ® p^ 2 . (23.133) 

x 

A purification of this state is 

1^*** = ^ y/i^)\x) x {\x)\^)) AlM , (23.134) 

a; 

where each \<f> x ) is a purification of p A2 (so that (ja;}!^}) * 2 is a purification as well). Let 
\cp) 1 1 2 2 be the state resulting from sending A\ and A 2 through the tensor product 
channel U^ <8> Uj^ 2 . We can write this state as follows by recalling the isometric extension 



of the erasure channel in (23.7): 



l \XB 1 E 1 B 2 E 2 1 V^ / 7~Ti \X/< u, \\B 1 B 2 E 2 < \i 

\V>) =-7^l^\/Px{x)\x) (\x)\<f> x )) 2 \e) 



^Y,^p^)\ x ) x ^ x )\^)) ElB2E2 \ e ) Bl - ( 23 - 135 ) 



Recall that Bob can perform an isometry on B\ of the form in (23.89) that identifies whether 
he receives the state or the erasure symbol, and let Z B be the classical flag indicating the 
outcome. Eve can do the same, and let Z E indicate her flag. Then we can evaluate the 
coherent information of the state resulting from sending system A\ through the erasure 
channel and A 2 through the other channel J\f 2 : 

H(B 1 B 2 ) - H{E X E 2 ) = H{B X Z B B 2 ) - H{E 1 Z E E 2 ) (23.136) 

= H{B X B 2 \Z B ) + H(Z B ) - H(E 1 E 2 \Z E ) - H{Z E ) (23.137) 

= E{B X B 2 \Z B ) - H{E 1 E 2 \Z E ) (23.138) 

= l -[H{B 2 ) + H(A 1 B 2 )} - ^[HiA.E,) + H(E 2 )] (23.139) 

= l -[H{B 2 ) + H(XE 2 )] - l -[H{XB 2 ) + H(E 2 )} (23.140) 

= 1 -[I(X;B 2 )-I(X;E 2 )}. (23.141) 
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The first equality follows because Bob and Eve can perform the isometries that identify 
whether they receive the state or the erasure flag. The second equality follows from the 
chaining rule for entropy, and the third follows because the entropies of the flag registers Zb 
and Ze are equal for a 50% erasure channel. The fourth equality follows because the registers 
Zb and Ze are classical, and we can evaluate the conditional entropies as a uniform convex 
sum of different possibilities: Bob obtaining the state transmitted or not, and Eve obtaining 
the state transmitted or not. The fifth equality follows because the state on A1B2XE2 is 
pure when conditioning on Bob getting the output of the erasure channel, and the same 
holds for when Eve gets the output of the erasure channel. The final equality follows from 
adding and subtracting H(X) and from the definition of quantum mutual information. □ 

Armed with the above theorem, we need to find an example of a quantum channel that 
has zero quantum capacity, but for which there exists an ensemble that registers a non- 
zero private information. If such a channel were to exist, we could combine it with a 50% 
erasure channel in order to achieve a non-zero coherent information (and thus a non-zero 
quantum capacity) for the joint channel. Indeed, such a channel exists, and it is known as an 
entanglement-binding channel. It has the ability to generate private classical communication 
but no ability to transmit quantum information (we point the reader to Refs. |151[ 1153] for 
further details on these channels). Thus, the 50% erasure channel and the entanglement- 
binding channel can super activate each other. 

The startling phenomenon of superactivation has important implications for quantum 
data transmission. First, it implies that a quantum channel's ability to transmit quantum 
information depends on the context in which it is used. For example, if other seemingly 
useless channels are available, it could be possible to transmit more quantum information 
than would be possible were the channels used alone. Next, and more importantly for 
quantum Shannon theory, it implies that whatever formula might eventually be found to 
characterize quantum capacity (some characterization other than the regularized coherent 



information in Theorem 23.3.1), it should be strongly non-additive in some cases (strongly 
non-additive in the sense of superactivation). That is, suppose that Q/(N) is some unknown 
formula for the quantum capacity of A/" and Q 1 {M.) is the same formula characterizing the 
quantum capacity of M.. Then this formula in general should be strongly non-additive in 
some cases: 

Q\M® M) >Q\N) + Q\M). (23.142) 

The discovery of superactivation has led us to realize that at present we are much farther 
than we might have thought from understanding reliable communication rates over quantum 
channels. 



23.8 Entanglement Distillation 



We close out this chaper with a final application of the techniques in the direct coding part 



of Theorem |23.3.1| to the task of entanglement distillation. Entanglement distillation is a 
protocol where Alice and Bob begin with many copies of some bipartite state p AB . They 
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attempt to distill ebits from it at some positive rate by employing local operations and 
forward classical communication from Alice to Bob. If the state is pure, then Alice and Bob 



should simply perform the entanglement concentration protocol from Chapter [T8J and there 
is no need for forward classical communication in this case. Otherwise, they can perform the 
protocol given in the proof of the following theorem. 

Theorem 23.8.1 (Devetak- Winter). Suppose that Alice and Bob share the state (p^- 8 )®"* 
where n is an arbitrarily large number. Then it is possible for them to distill ebits at the rate 
I(A)B) if they are allowed forward classical communication from Alice to Bob. 

We should mention that we have already proved the statement in the above theorem 



with the protocol given in Corollary 21.4.2. Nevertheless, it is still instructive to exploit 
the techniques from this chapter in proving the existence of an entanglement distillation 
protocol. 

Proof. Suppose that Alice and Bob begin with a general bipartite state p AB with purification 
if) ABE . We can write the purification in Schmidt form as follows: 



I / \ ABE _ 

The n th extension of the above state is 

I ,\A n B n E n 

m = 

x n £X 



Y, VpA^)\x) A ® \^) BE - (23.143) 



J2 Vp^j\x n ) An ®\M BnEn . (23.144) 

The protocol begins with Alice performing a type class measurement given by the type 



projectors (recall from (14.118) that the typical projector decomposes into a sum of the type 
class projectors): 

n™= Yl \x n )(x n \. (23.145) 

x n <=T* n 

If the type resulting from the measurement is not a typical type, then Alice aborts the 
protocol (this result happens with arbitrarily small probability). If it is a typical type, they 
can then consider a code over a particular type class t with the following structure: 

LMK « \T t \ « 2 nH{x \ (23.146) 

K « 2 nI(x ' E \ (23.147) 

MK » 2 n/(X;B) , (23.148) 

where t is the type class and the entropies are with respect to the following dephased state: 

J2px(x)\x)(x\ X ® \ip x )(ip x \ BE . (23.149) 

x&X 
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It follows that M « 2< I{X ^~ I ^ E ^ = 2 n ^ H ^- H ^ E ^ and L « 2 nH ( x l B ). We label the 
codewords as x n (l,m,k) where x n (l,m,k) G T t . Thus, we instead operate on the following 
state |-0 t ^" B ' l£; " resulting from the type class measurement: 

\A) AnBnEn = -*= £ I*")*" ® llM*"*"- (23.150) 

V M*l i-eTi 

The protocol proceeds as follows. Alice first performs the following incomplete measurement 
of the system A n : 



T t = ^2\m,k)(x n {l,m,k)\ An \ . (23.151; 



rn,k ) i 



This measurement collapses the above state as follows: 



— ==^2\m,k) A ' 1 <g> \ip xn (i, m , k )) ■ (23.152) 



rn,k 



Alice transmits the classical information in I to Bob, using nH(X\B) bits of classical in- 
formation. Bob needs to know / so that he can know in which code they are operating. 



Bob then constructs the following isometry, a coherent POVM similar to that in (23.30) 
(constructed from the POVM for a private classical communication code): 

Y,^Zk®\™,k) B . (23.153) 

m,k 

After performing the above coherent POVM, his state is close to the following one: 



-^==£|m,fc) A " <g) |m,fc) B |VV(z,m,fc)> . (23.154) 



m,k 

Alice then performs a measurement of the k register in the Fourier-transformed basis: 

{\i) = ^=Y,^ ktlK \ k )\ ■ ( 23 - 155 ) 

I v fc > te{i,...,K} 

Alice performs this particular measurement because she would like Bob and Eve to maintain 
their entanglement in the k variable. The state resulting from this measurement is 

-i=E|m} Ari ® e M ^|m,fc} B |^ ( ^ ) > B ^. (23.156) 

m,k 

Alice then uses nI(X; E) bits to communicate the t variable to Bob. Bob then applies the 
phase transformation Z*>(t), where 

Z\t) = J2 e- i2wtk/K \k){k\, (23.157) 
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to his k variable in register B. The resulting state is 



-7^^™)^ ® l^,^) S |^»(/,m,fe)> 



B n E n 



(23.158) 



m,k 



They then proceed as in the final steps (23.47 23.54) of the protocol from the direct coding 



part of Theorem |23.3.1[ and they extract a state close to a maximally entangled state of the 
following form: 



1 \^l V 
t= / \fn) 

M^ 



i \ B 
\m) , 



(23.159) 



with rate equal to (log M)/n = H(B) - H(E). D 

Exercise 23.8.1 Argue that the above protocol cannot perform the task of state transfer 



as can the protocol in Corollary 21.4.2 



23.9 History and Further Reading 



The quantum capacity theorem has a long history that led to many important discoveries in 
quantum information theory. Shor first stated the problem of finding the quantum capacity 
of a quantum channel in his seminal paper on quantum error correction 



DiVincenzo et 

al. demonstrated that the coherent information of the depolarizing channel is superadditive 
by concatenating a random code with a repetition code [ED] (this result in hindsight was re- 
markable given that the coherent information was not even known at the time). Smith and 
Smolin later extended this result to show that the coherent information is strongly super- 
additive for several examples of Pauli channels [232] . Schumacher and Nielsen demonstrated 
that the coherent information obeys a quantum data processing inequality [218], much like 
the classical data processing inequality for mutual information. Schumacher and Westmore- 
land started making connections between quantum privacy and quantum coherence [220J. 
Bennett et al. [30] and Barnum et al. [15] demonstrated that forward classical communication 
cannot increase the quantum capacity. In the same paper, Bennett et al. [30] introduced 
the idea of entanglement distillation, which has important connections with the quantum 
capacity. 

Barnum, Knill, Nielsen, and Schumacher made important progress on the quantum ca- 
pacity theorem in a series of papers that established the coherent information upper bound 
on the quantum capacity J217L 12181 IT6| li~5] . Lloyd |185j , Shor [227J , and Devetak |68J are gen- 
erally credited with proving the coherent information lower bound on the quantum capacity, 
though an inspection of Lloyd's proof reveals that it is perhaps not as rigorous as the latter 
two proofs. Shor delivered his proof of the lower bound in a lecture |227] . though he never 
published this proof in a journal. Later, Hayden, Shor, and Winter published a paper [134J 
detailing a proof of the quantum capacity theorem that they considered to be closest in spirit 
to Shor's proof in Ref. [227J. After Shor's proof, Devetak provided a fully rigorous proof of 
the lower bound on the quantum capacity [68J , by analyzing superpositions of the codewords 
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from private classical codes. This is the approach we have taken in this chapter. We should 
also mention that Hamada showed how to achieve the coherent information for certain input 
states by using random stabilizer codes |121j . and Harrington and Preskill showed how to 



achieve the coherent information rate for a very specific class of channels |122] . 

Another approach to proving the quantum capacity theorem is known as the decoupling 
approach [132] . This approach exploits a fundamental concept introduced by Schumacher 
and Westmoreland in Ref. |221j . Suppose that the reference, Bob, and Eve share a tripartite 
pure entangled state \if)) after Alice transmits her share of the entanglement with the 

reference through a noisy channel. Then if the reduced state ifi on the reference system 
and Eve's system is approximately decoupled, meaning that 

\\i/j RE -i; R ®(T E \\ l <e, (23.160) 

where a E is some arbitrary state, this implies that Bob can decode the quantum information 
that Alice intended to send to him. Why is this so? Let's suppose that the state is exactly 
decoupled. Then one purification of the state i\) RE is the state \ip) that they share after 
the channel acts. Another purification of ifi = if) R <S> <J E is 

IV^ 1 ® W) B2E , (23.161) 

where \ifi) Ms the original state that Alice sent through the channel and \a) 2 is some 
other state that purifies the state a E of the environment. Since all purifications are related 
by isometries and since Bob possesses the purification of R and E, there exists some unitary 

TjB~ *B\B2 qviqJ^ that 

U B ^ B2 \^) RBE = \^) RBl ®\a) B2E . (23.162) 

This unitary is then Bob's decoder! Thus, the decoupling condition implies the existence 
of a decoder for Bob, so that it is only necessary to show the existence of an encoder 
that decouples the reference from the environment. Simply put, the structure of quantum 
mechanics allows for this way of proving the quantum capacity theorem. 

Many researchers have now exploited the decoupling approach in a variety of contexts. 
This approach is implicit in Devetak's proof of the quantum capacity theorem [68J . Horodecki 
et al. exploited it to prove the existence of a state merging protocol |148[ 1149] . Yard and 
Devetak |267] and Ye et al. |271j used it in their proofs of the state redistribution proto- 
col. Dupuis et al. [EI] proved the best known characterization of the entanglement-assisted 
quantum capacity of the broadcast channel with this approach. The thesis of Dupuis and 
subsequent work generalize this decoupling approach to settings beyond the traditional IID 
setting [521 E3]- Datta and coworkers have also applied this approach in a variety of con- 
texts [501 IB"3| [62] , and Ref. [251] used the approach to study quantum communication with 
a noisy channel and a noisy state. 

Bennett et al. found the quantum capacity of the erasure channel in Ref. [29], and 
Fazio and Giovannetti computed the quantum capacity of the amplitude damping channel in 
Ref. |102] . Smith et al. showed superactivation in Ref. [234] and later showed superactivation 
for channels that can be realized more easily in the laboratory [233J. Devetak and Winter 
established that the coherent information is achievable for entanglement distillation [761 . 
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CHAPTER 24 



Trading Resources for Communication 



This chapter unifies all of the channel coding theorems that we have studied in this book. 
One of the most general information processing tasks that a sender and receiver can accom- 
plish is to transmit classical and quantum information and generate entanglement with many 
independent uses of a quantum channel and with the assistance of classical communication, 
quantum communication, and shared entanglement]^] The resulting rates for communica- 
tion are net rates that give the generation rate of a resource less its consumption rate. 
Since we have three resources, all achievable rates are rate triples (C, Q, E) that lie in a 
three-dimensional capacity region, where C is the net rate of classical communication, Q is 
the net rate of quantum communication, and E is the net rate of entanglement consump- 
tion/generation. The capacity theorem for this general scenario is known as the quantum 
dynamic capacity theorem, and it is the main theorem that we prove in this chapter. All of 
the rates given in the channel coding theorems of previous chapters are special points in this 
three-dimensional capacity region. 

The proof of the quantum dynamic capacity theorem comes in two parts: the direct 
coding theorem and the converse theorem. The direct coding theorem demonstrates that 
the strategy for achieving any point in the three-dimensional capacity region is remarkably 



simple: we just combine the protocol from Corollary 21.5.2 for entanglement-assisted classical 
and quantum communication with the three unit protocols of teleportation, super-dense 
coding, and entanglement distribution. The interpretation of the achievable rate region 
is that it is the unit resource capacity region from Chapter [8] translated along the points 



achievable with the protocol from Corollary 21.5.2. The proof of the converse theorem is 



perhaps the more difficult part — we analyze the most general protocol that can consume and 
generate classical communication, quantum communication, and entanglement along with 
the consumption of many independent uses of a quantum channel, and we show that the net 
rates for such a protocol are bounded by the achievable rate region. In the general case, our 
characterization is multi-letter, meaning that the computation of the capacity region requires 



Recall that Chapter [8J addressed a special case of this information processing task that applies to the 
scenario in which the sender and receiver do not have access to many independent uses of a noisy quantum 
channel. 

595 



596 CHAPTER 24. TRADING RESOURCES FOR COMMUNICATION 



an optimization over a potentially infinite number of channel uses and is thus intractable. 



Though, the quantum Hadamard channels from Section 5.2.4 are a special class of channels 
for which the regularization is not necessary, and we can compute their capacity regions over 
a single instance of the channel. Another important class of channels for which the capacity 
region is known is the class of lossy bosonic channels (though the optimality proof is only 
up to a long-standing conjecture which many researchers believe to be true). These lossy 
bosonic channels model free-space communication or loss in a fiber optic cable and thus have 
an elevated impetus for study because of their importance in practical applications. 

One of the most important questions for communication in this three-dimensional set- 
ting is whether it is really necessary to exploit the trade-off coding strategy given in Corol- 



lary 21.5.2 That is, would it be best simply to use a classical communication code for 
a fraction of the channel uses, a quantum communication code for another fraction, an 
entanglement-assisted code for another fraction, etc.? Such a strategy is known as time- 
sharing and allows the sender and receiver to achieve convex combinations of any rate triples 
in the capacity region. The answer to this question depends on the channel. For exam- 
ple, time- sharing is optimal for the quantum erasure channel, but it is not for a dephasing 
channel or a lossy bosonic channel. In fact, trade-off coding for a lossy bosonic channel can 
give tremendous performance gains over time-sharing. How can we know which one will 
perform better in the general case? It is hard to say, but at the very least, we know that 



time-sharing is a special case of trade-off coding as we argued in Section |21.5.2[ Thus, from 
this perspective, it might make sense simply to always use a trade-off strategy. 

We organize this chapter as follows. We first review the information processing task 



corresponding to the quantum dynamic capacity region. Section |24.2| states the quantum 
dynamic capacity theorem and shows how many of the capacity theorems we studied pre- 
viously arise as special cases of it. The next two sections prove the direct coding theorem 



and the converse theorem. Section 24.4.2 introduces the quantum dynamic capacity for- 
mula, which is important for analyzing whether the quantum dynamic capacity region is 
single-letter. In the final section of this chapter, we compute and plot the quantum dynamic 
capacity region for the dephasing channels and the lossy bosonic channels. 

24.1 The Information Processing Task 



Figure 24.1 depicts the most general protocol for generating classical communication, quan- 
tum communication, and entanglement with the consumption of a noisy quantum channel 
M A ^ B and the same respective resources. Alice possesses two classical registers (each la- 
beled by M and of dimension 2 nC ), a quantum register A\ of dimension 2 n ® entangled with 
a reference system R, and another quantum register T4 of dimension 2 nE that contains her 
half of the shared entanglement with Bob: 

She passes one of the classical registers and the registers A\ and T4 into a CPTP encoding 
map £ ma i t a^a n s A LA 2 faofc outputs a quantum register Sa of dimension 2 nE and a quantum 
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Reference 

R 




Figure 24.1: The most general protocol for generating classical communication, quantum communication, 
and entanglement generation with the help of the same respective resources and many uses of a noisy quantum 
channel. Alice begins with her classical register M, her quantum register A%, and her half of the shared 
entanglement in register Ta- She encodes according to some CPTP map £ that outputs a quantum register 
Sa, many registers A' n , a quantum register Ai, and a classical register L. She inputs A' n to many uses 
of the noisy channel M and transmits Ai over a noiseless quantum channel and L over a noiseless classical 
channel. Bob receives the channel outputs B n , the quantum register A2, and the classical register L and 
performs a decoding T> that recovers the quantum information and classical message. The decoding also 
generates entanglement with system Sa- Many protocols are a special case of the above one. For example, 
the protocol is entanglement-assisted communication of classical and quantum information if the registers 
L, Sa, Sb, and Ai are null. 
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register A2 of dimension 2 n< ^, a classical register L of dimension 2 nC , and many quantum 
systems A' n for input to the channel. The register Sa is for creating entanglement with Bob. 
The state after the encoding map £ is as follows: 



u> 



MA ln S A LA 2 RT B _ cMA x T A ^A ln S A LA 2 t MMRA X T A T B \ 



(24.2) 



She sends the systems A' n through many uses M A n ^ Bn of the noisy channel J\f A ^ B , trans- 
mits L over a noiseless classical channel, and transmits Ai over a noiseless quantum channel, 
producing the following state: 



UJ MB n S A LA 2 RT B — jyA' n ^B n sMA'"S A LA 2 RT B -> 

The above state is a state of the following form: 

Y,Px{x)\x){x\ x ®N A ' n ^ Bn {p A - 



„AA' n 



(24.3) 



(24.4) 



with A = RT B A 2 S A and X = ML. Bob then applies a map ^b^a 2 t b l^b 1 s b m that outputs 
a quantum system B\, a quantum system Sb, and a classical register M. Let u/ denote the 
final state. The following condition holds for a good protocol: 



$ 



AIM 



$ 



RBi 



$ S -4' 



(u/ 



MB 1 S B MS A R 



<e, 



(24.5) 



implying that Alice and Bob establish maximal classical correlations in M and M and 
maximal entanglement between Sa and Sb- The above condition also implies that the coding 
scheme preserves the entanglement with the reference system R. The net rate triple for the 

protocol is as follows: [C — C — 8, Q — Q — 5, E — E — Sj for some arbitrarily small 5 > 0. 
The protocol generates a resource if its corresponding rate is positive, and it consumes a 
resource if its corresponding rate is negative. We say that a rate triple (C, Q, E) is achievable 
if there exists a protocol of the above form for all 5, e > and sufficiently large n. 



24.2 The Quantum Dynamic Capacity Theorem 

The dynamic capacity theorem gives bounds on the reliable communication rates of a noisy 
quantum channel when combined with the noiseless resources of classical communication, 
quantum communication, and shared entanglement. The theorem applies regardless of 
whether a protocol consumes the noiseless resources or generates them. 

Theorem 24.2.1 (Quantum Dynamic Capacity). The dynamic capacity region Ccqe(AZ') of 

a quantum channel M is equal to the following expression: 



Ccqe(A0 = [J ^»*) 



(24.6) 



fe=i 
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where the overbar indicates the closure of a set. The "one-shot" region CcqeC^O ^ s ^ e u ™ on 
of the "one-shot, one-state" regions CqX e a (J\f) : 



OA0 = U C c(kW 



(24.7) 



The "one-shot, one-state" region Qjq E(T (AO is the set of all rates C, Q, and E, such that 

C + 2Q<I(AX;B) (T , (24.8) 

Q + E<I{A)BX) a , (24.9) 

C + Q + E<I(X;B) a + I(A)BX) (T . (24.10) 

The above entropic quantities are with respect to a classical- quantum state a XAB where 



a 



Y,Px{x)\x){x\ x ®U A '- 



and the states <f> AA are pure. It is implicit that one should consider states on A' k instead of 



A' when taking the regularization in (24-6). 

The above theorem is a "multi-letter" capacity theorem because of the regularization in 



(24.6). Though, we show in Section 24.5.1 that the regularization is not necessary for the 



Hadamard class of channels. We prove the above theorem in two parts: 



1. The direct coding theorem in Section 24.3 shows that combining the protocol from 



Corollary |21 .5.2| with teleportation, super-dense coding, and entanglement distribution 
achieves the above region. 



2. The converse theorem in Section 24.4 demonstrates that any coding scheme cannot 



do better than the regularization in (24.6), in the sense that a scheme with vanishing 



error should have its rates below the above amounts. 
Exercise 24.2.1 Show that it suffices to evaluate just the following four entropies in order 



to determine the one-shot, one-state region in Theorem |24.2.1 

H{A\X) a = Y,Px{x)H{A)^ 



H{B) a = H\Y,p x (x)U A '- 
H(B\X) a = J2px(x)H(Af A '- 

X 

H(E\X) a = J2Px(x)H((Af c 



>Bf±A's 



■B/±A'^ 



A'-*E ( ,A' 

1 Wx J I ■ 



(24.12) 

(24.13) 
(24.14) 
(24.15) 



where the state a XAB is of the form in (24.11). 
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24.2.1 Special Cases of the Quantum Dynamic Capacity Theorem 

We first consider five special cases of the above capacity theorem that arise when Q and 
E both vanish, C and E both vanish, or one of C, Q, or E vanishes. The first two cases 



correspond respectively to the classical capacity theorem from Chapter 19 and the quan- 



tum capacity theorem from Chapter [23} Each of the other special cases traces out a two- 
dimensional achievable rate region in the three-dimensional capacity region. The five coding 
scenarios are as follows: 

1. Classical communication (C) when there is no entanglement assistance or quantum 
communication. The achievable rate region lies on the (C, 0, 0) ray extending from the 
origin. 

2. Quantum communication (Q) when there is no entanglement assistance or classical 
communication. The achievable rate region lies on the (0, Q, 0) ray extending from the 
origin. 

3. Entanglement-assisted quantum communication (QE) when there is no classical com- 
munication. The achievable rate region lies in the (0, Q, —E) quarter-plane of the 



three-dimensional region in Theorem 24.2.1 



4. Classically-enhanced quantum communication (CQ) when there is no entanglement 
assistance. The achievable rate region lies in the (C, Q, 0) quarter-plane of the three- 



dimensional region in Theorem 24.2.1. 



5. Entanglement-assisted classical communication (CE) when there is no quantum com- 
munication. The achievable rate region lies in the (C, 0, —E) quarter-plane of the 



three-dimensional region in Theorem 24.2.1 



Classical Capacity 

The following theorem gives the one-dimensional capacity region Cc(AT) of a quantum chan- 
nel J\f for classical communication. 

Theorem 24.2.2 (Holevo- Schumacher- Westmoreland). The classical capacity region Cq{M) 
is given by 



C C (M) = |J \c { c\N® k ). (24.16) 

fc=i 

The "one-shot" region C c (AT) is the union of the regions C C(7 (N), where C c a {Af) is the set 
of all C > , such that 

C<I(X;B) a + I(A)BX) (T . (24.17) 



The entropic quantity is with respect to the state <j XABE in (24-11) 
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The bound in (24.17) is a special case of the bound in (24.10) with Q = and E = 0. 
The above characterization of the classical capacity region may seem slightly different from 
the characterization in Chapter [T9l until we make a few observations. First, we rewrite the 



coherent information I{A)BX) cr as H(B\X) a - H(E\X) a . Then I(X;B) a + I{A)BX) u 



A' 



H(B) a — H(E\X) a . Next, pure states of the form l^}^ are sufficient to attain the classical 



capacity of a quantum channel (see Theorem 12.3.2). We briefly recall this argument. An 



ensemble of the following form realizes the classical capacity of a quantum channel: 



p xA ' = Y.p^ x )\ x )( x \ x ®p*- ( 24 - 18 ) 



This ensemble itself is a restriction of the ensemble in (24.11 ) to the systems X and A'. Each 
mixed state p A admits a spectral decomposition of the form p A = J2 y PY\x(y\x)4> A y where 
^xv ^ s a P ure state. We can define an augmented classical- quantum state Q XYA ' as follows: 

J2PY\x(y\x) Px (x)\x)(x\ X ®\y)(y\ Y ®i/j* y , (24.19) 



nXYA' 



XAJ 



so that Tr Y {0 X¥A '} = p XA ' . Sending the A' system of the states p XA ' and XYA ' leads to 
the respective states p XB and 6 XYB . Then the following equality and inequality hold 

I(X;B) p = I(X;B) e (24.20) 

<I(XY;B) e , (24.21) 

where the equality holds because Tty{0 xy } = p XA and the inequality follows from quan- 
tum data processing. Redefining the classical variable as the joint random variable X, Y 
reveals that it is sufficient to consider pure state ensembles for the classical capacity. Re- 
turning to our main argument, then H(E\X) a = H(B\X) a so that I(X; B) a + I(A)BX) a = 



H(B) a — H(B\X) a = I(X] B) a for states of this form. Thus, the expression in (24.17) can 



never exceed the classical capacity and finds its maximum exactly at the Holevo information. 

Quantum Capacity 

The following theorem gives the one- dimensional quantum capacity region Cq(A/") of a quan- 
tum channel J\f. 

Theorem 24.2.3 (Quantum Capacity). The quantum capacity region Cq(A/") is given by 



C ^) = U l C Q^ k )- ( 24 - 22 ) 

fc=i 

The "one-shot" region Cq (AT) is the union of the regions Cq^AT), where Cq^AT) is the set 
of all Q > 0, such that 

Q < I(A)BX) a . (24.23) 

The entropic quantity is with respect to the state a XABE in (24-11) with the restriction that 
the density px{x) is degenerate. 
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The bound in (24.23) is a special case of the bound in (24.9) with E = 0. The other 



bounds in Theorem 24.2.1 are looser than the bound in (24.9) when C,E = 0. 



Entanglement- Assisted Quantum Capacity 

The following theorem gives the two-dimensional entanglement-assisted quantum capacity 
region Cqe(A/*) of a quantum channel J\f. 

Theorem 24.2.4 (Devetak-Harrow- Winter). The entanglement- assisted quantum capacity 
region Cqe(A/") ts given by 



1 
Cqe(A0 = [J - k C { ^{N^). 



(24.24) 



fc=i 



,(i) 



(i) 



,(i) 



The "one-shot" region CqL(AT) is the union of the regions Cq E(7 (7V), where Cql a {J\f) is the 
set of all Q,E > 0, such that 



2Q<I(AX;B) a , 
Q < I{A)BX) a 



\E\ 



(24.25) 
(24.26) 



The entropic quantities are with respect to the state a XABE in (24.11) with the restriction 
that the density px(x) is degenerate. 



and (24.9) with C = 0. The other bounds in Theorem 24.2.1 are looser than the bounds in 



The bounds in (24.25) and (24.26) are a special case of the respective bounds in (24.8) 



(24.8) and (24.9) when C = 0. Observe that the region is a union of general pentagons (see 



the Q-E-plane in Figure 24.2 for an example of one of these general pentagons in the union). 



Classically-Enhanced Quantum Capacity 

The following theorem gives the two-dimensional capacity region Ccq{N) for classically- 
enhanced quantum communication over a quantum channel J\f. 

Theorem 24.2.5 (Devetak-Shor). The classically- enhanced quantum capacity region Ccq(A/") 
is given by 



Ccq(A0 



U ic&cw 



(24.27) 



fc=i 



7(1) 



The "one- 
set of all C,Q >0, such that 



(i) 



,(i) 



shot" region CqMN) is the union of the regions Cqq (J\f), where Cqq (Af) is the 



C + Q<I(X;B) a + I(A)BX) a , 
Q < I(A)BX) a . 



(24.28) 
(24.29) 



The entropic quantities are with respect to the state u XABE in (24-11) 



The bounds in (24.28) and (24.29) are a special case of the respective bounds in (24.9) 



and (24.10) with E = 0. Observe that the region is a union of trapezoids (see the CQ-plane 



in Figure 24.2 for an example of one of these rectangles in the union). 
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Entanglement- Assisted Classical Capacity with Limited Entanglement 

Theorem 24.2.6 (Shor). The entanglement-assisted classical capacity region Cce(-AT) of a 
quantum channel M is 



1 
Cce(A0 = |J rCg(jy 



fc=i 



k 



(24.30) 



.(i) 



(i) 



,(i) 



The "one-shot" region C^eC^O is the union of the regions C c - a (N), where Cq-^N") is the 
set of all C,E > 0, such that 



C<I{AX;B) a , 

C <I(X;B) a + I(A)BX) a + \E\ 



(24.31) 
(24.32) 



where the entropic quantities are with respect to the state cr XABE in (24-11) 



The bounds in (24.31) and (24.32) are a special case of the respective bounds in (24.8) 



and (24.10) with Q = 0. Observe that the region is a union of general polyhedra (see the 



CE-plane in Figure 24.2 for an example of one of these general polyhedra in the union). 



24.3 The Direct Coding Theorem 



The unit resource achievable region is what Alice and Bob can achieve with the protocols 
entanglement distribution, teleportation, and super-dense coding (see Chapter [8]). It is the 
cone of the rate triples corresponding to these protocols: 



{a(0, -1, 1) + 0(2, -1, -1) + 7 (-2, 1, -1) : a, (3, 7 > 0}. 



(24.33) 



We can also write any rate triple (C, Q,E) in the unit resource capacity region with a matrix 
equation: 

~c 

Q 

E 
The inverse of the above matrix is as follows: 






2 


-2" 




a 


-1 


-1 


1 




P 


1 


-1 


-1 




J. 







1 





1 

2 
1 


1 

2 
1 



2 2 2. 

and gives the following set of inequalities for the unit resource achievable region: 

C + 2Q< 0, 

Q + E<0, 

C + Q + E<0, 



(24.34) 



(24.35) 



(24.36) 
(24.37) 
(24.38) 
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by inverting the matrix equation in (24.34) and applying the constraints «,/3,7 > 0. 



Now, let us include the protocol from Corollary 21.5.2 for entanglement-assisted commu- 



nication of classical and quantum information. Corollary |21.5.2| states that we can achieve 
the following rate triple by channel coding over a noisy quantum channel M A ^ B : 



(j(X; B) a , h{A- B\X) a , -h{A- E\X))j 



(24.39) 



for any state a x of the form: 

xabe _ 



Y.Vx{x)\x){x\ x ®u«- be {^: 



(24.40) 



where Ufy ^ BE is an isometric extension of the quantum channel J\f A ~^ B 



. Specifically, we 

showed in Corollary |21.5.2| that one can achieve the above rates with vanishing error in the 
limit of large blocklength. Thus the achievable rate region is the following translation of the 
unit resource achievable region in (24.34[): 







2" 




a 








P 


+ 


1 




.7. 





I(X-B) a 
\l{^B\X) a 

-\l{A-E\X) a _ 



(24.41) 



We can now determine bounds on an achievable rate region that employs the above coding 
strategy. We apply the inverse of the matrix in ( |24.34 ) to the LHS and RHS, giving 



1 

2 


-1 





n 


1 


1 




2 


v> 


i 


1 


1 


2 


2 


2 



c 



E 



2J 



I(X;B) 

lHA;B\X) a 
-\l{A-E\X) a \ 



(24.42) 



Then using the following identities 



I(X;B) a + I(A;B\X) a = I(AX;B) <T , 
h{A- B\X) a - h{A- E\X) a = I(A)BX) a , 



(24.43) 
(24.44) 



and the constraints a,/?, 7 > 0, we obtain the inequalities in (24.8 24.10), corresponding 



exactly to the one-shot, one-state region in Theorem 24.2.1 Taking the union over all 



possible states a in (24.11) and taking the regularization gives the full dynamic achievable 
rate region. 



Figure 24.2 illustrates an example of the general polyhedron specified by (24.8 24.10), 



where the channel is the qubit dephasing channel p 
parameter p = 0.2, and the input state is 



(1 — p)p + pZpZ with dephasing 



XAA' 



-(|0)(0| X ®^ 



x 



\1){1\ ®<f> 



b AA> 



(24.45) 
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where 



,AA' 



X AA' 



,) AA ' = v^ioo}^' 



vWi|n) AA ', 

vV~m) AA '- 



(24.46) 
(24.47) 



The state a XABE resulting from the channel is Ufy~' BE ((j XAA ) where C//y is an isometric 
extension of the qubit dephasing channel. The figure captio n pro vides a detailed explanation 
of the one-shot, one-state region Cqq E (note that Figure 24.2 displays the one-shot, one- 
state region and does not display the full capacity region). 



24.4 The Converse Theorem 



We provide a catalytic, information theoretic converse proof of the dynamic capacity region, 



showing that (24.6) gives a multi-letter characterization of it. The catalytic approach means 



that we are considering the most general protocol that consumes and generates classical 
communication, quantum communication, and entanglement in addition to the uses of the 
noisy quantum channel. This approach has the advantage that we can prove the converse 
theorem in "one fell swoop." We employ the Alicki-Fannes' inequality, the chain rule for 
quantum mutual information, elementary properties of quantum entropy, and the quantum 
data processing inequality to prove the converse. 



We show that the bounds in (24.8 24.10) hold for common randomness generation instead 



of classical communication because a capacity for generating common randomness can only 
be better than that for generating classical communication (classical communication can 
generate common randomness). We also consider a protocol that preserves entanglement 
with a reference system instead of one that generates quantum communication. 
We prove that the converse theorem holds for a state of the following form 



a 



XAB 



Y,p{x)\x)(A x ®M a '^ b {p a x a '\ 



(24.48) 



where the states p A are mixed, rather than proving it for a state of the form in (24.11). 



Then we show in Section |24.4.1| that it is not necessary to consider an ensemble of mixed 
states — i.e., we can do just as well with an ensemble of pure states, giving the statement of 
Theorem 124.2.11 



We first prove the bound in (24.8). Consider the following chain of inequalities: 



n(C + 2Q) = I(M; M) ¥ + I(R; B^ 

<I(M;M) u;l + I(R;B 1 ) u)l + n5' 

< I(M; B n A 2 LT B ) u + I(R; B n A 2 LT B ) w 

< I(M; B n A 2 LT B ) u] + I(R; B n A 2 LT B M) uj 
= I(M; B n A 2 LT B ) w + I(R; B n A 2 LT B \M)^ 



I(R;M\ 



(24.49) 
(24.50) 
(24.51) 
(24.52) 
(24.53) 
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CEF-TP 



CEQ 
Classical communication rate 



(i) 



Figure 24.2: An example of the one-shot, one-state achievable region Cqq E a {M) corresponding to a 



XABE 



state (j^^ Drj that arises from a qubit dephasing channel with dephasing parameter p = 0.2. The figure 
depicts the octant corresponding to the consumption of entanglement and the g enerat ion of classical and 
quantum communication. The state input to the channel TV is a x , defined in (24.45). The plot features 
seven achievable corner points of the one-shot, one-state region. We can achieve the convex hull of these 
seven points by time-sharing any two different coding strategies. We can also achieve any point above an 
achievable point by consuming more entanglement than necessary. The seven achievable points correspond 



to entanglement-assisted quantum communication (EAQ), the protocol from Corollary 21.5.3 for classically- 
enhanced quantum communication (CEQ), the protocol from Theorem 21.5.1 for entanglement-assisted 



classical communication with limited entanglement (EAC), quantum communication (LSD), combining CEF 



with entanglement distribution and super-dense coding (CEF-SD-ED), the protocol from Corollary 21.5.2 



for entanglement-assisted communication of classical and quantum information (CEF), and combining CEF 
with teleportation (CEF-TP). Observe that we can obtain EAC by combining CEF with super-dense coding, 
so that the points CEQ, CEF, EAC, and CEF-SD-ED all lie in plane III. Observe that we can obtain 
CEQ from CEF by entanglement distribution and we can obtain LSD from EAQ and EAQ from CEF- 
TP, both by entanglement distribution. Thus, the points CEF, CEQ, LSD, EAQ, and CEF-TP all lie in 
plane II. Finally, observe that we can obtain all corner points by combining CEF with the unit protocols 
of teleportation, super-dense coding, and entanglement distribution. The bounds in ( 24.8|2L10 ) uniquely 
specify the respective planes I-III. We obtain the full achievable region by taking the union over al l states 
a of the one-shot, one-state regions C a (AT) and taking the regularization, as outlined in Theorem 24.2.1 
The above region is a translation of the unit resource capacity region from Chapter [8] to the protocol for 
entanglement-assisted communication of classical and quantum information. 
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The first equality holds by evaluating the quantum mutual informations on the re spective 



states $ and $ RBl . The first inequality follows from the condition in (24.5) and an 



application of the Alicki-Fannes' inequality where 5' vanishes as e — > 0. We suppress this 
term in the rest of the inequalities for convenience. The second inequality follows from 
quantum data processing, and the third follows from another application of quantum data 
processing. The second equality follows by applying the mutual information chain rule. We 
continue below: 

= I(M; B n A 2 LT B ) u + I(R; B n A 2 LT B \M) uj (24.54) 

= I(M; B n A 2 LT B ) u + I(RA 2 LT B ; B n \M) u 

+ I(R; A 2 LT B \M)„ - I(B n ; A 2 LT B \M) U (24.55) 

= I(M; B n ) w + I(M; A 2 LT B \B n ) u + I(RA 2 LT B ; B n \M) w 

+ I(R; A 2 LT B \M) u - I(B n ; A 2 LT B \M) w (24.56) 

= I(RA 2 LMT B ; B n ) u + I(M; A 2 LT B \B n ) u 

+ I(R; A 2 LT B \M) W - I(B n ; A 2 LT B \M)^ (24.57) 

< I{RA 2 T B S A LM- B n )„ + I(M; A 2 LT B \B n ) w 

+ I(R; A 2 LT B \M) u - I(B n ; A 2 LT B \M) w (24.58) 

= I(AX; B n ) w + I(M; A 2 LT B \B n )„ 

+ I(R; A 2 LT B \M) w - I(B n ; A 2 LT B \M) u . (24.59) 

The first equality follows because I(R; M) w = for this protocol. The second equality follows 
from applying the chain rule for quantum mutual information to the term I(R; B n A 2 LT B \M) uj , 
and the third is another application of the chain rule to the term I(M; B n A 2 LT B ) uj . The 
fourth equality follows by combining I(M;B n ) and I(RA 2 LT B ; B n \M) with the chain 
rule. The inequality follows from an application of quantum data processing. The final 
equality follows from the definitions A = RT b A 2 Sa and X = ML. We now focus on the 
term I(M; A 2 LT B \B n ) M + I(R; A 2 LT B \M) u - I(B n ; A 2 LT B \M)^ and show that it is less than 

n(c 



(C + 2Q 



I(M-A 2 LT B \B n )^ + I(R-A 2 LT B \M) w - I(B n -A 2 LT B \M)^ 

= I(M; A 2 LT B B n )^ + I(R; A 2 LT B M)^ - I(B n ; A 2 LT B M)^ - I(R; M) w (24.60) 

= I(M; A 2 LT B B n )^ + I(R; A 2 LT B M)^ - I(B n ; A.LTbM)^ (24.61) 

= H{A 2 LT B B n ) u + H{R) U - H{RA 2 LT B \M)„ - H(B n ) u (24.62) 

= H(A 2 LT B B n )^ - H(A 2 LT B \MR)^ - H(B n ) u (24.63) 

= H(A 2 LT B \B n ) u - H(A 2 LT B \MR) U . (24.64) 

The first equality follows by applying the chain rule for quantum mutual information. The 
second equality follows because I(R; M)^ = for this protocol. The third equality follows 
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by expanding the quantum mutual informations. The next two inequalities follow from 
straightforward entropic manipulations and that H(R) uj = H(R\M) U for this protocol. We 
continue below: 

= H(A 2 L\B n ) u + H(T B \B n A 2 L) ul - H(T B \MR) U - H(A 2 L\T B MR) U (24.65) 

= H(A 2 L\B n ) w + H(T B \B n A 2 L)„ - H(T B ) W - H{A 2 L\T B MR)„ (24.66) 

= H(A 2 L\B n ) u - I(T B] B n A 2 L) uj - H(A 2 L\T B MR) (24.67) 

< H(A 2 L) U - H(A 2 L\T B MR) U (24.68) 

= I(A 2 L;T B MR) w (24.69) 

= I(L; TbMR)^ + I(A 2 ; T B Mi?|L) w (24.70) 

<n(c + 2Q). (24.71) 



The first two equalities follow from the chain rule for entropy and the second exploits that 
H{T B \MR) = H(T B ) for this protocol. The third equality follows from the definition of 
quantum mutual information. The inequality follows from subadditivity of entropy and 
that I (T B ; B n A 2 L) w > 0. The fourth equality follows from the definition of quantum mu- 
tual information and the next equality follows from the chain rule. The final inequality 
follows because the quantum mutual information I(L; T B MR) w can never be larger than 
the logarithm of the dimension of the classical register L and because the quantum mutual 
information I(A 2 ; TsMRlL)^ can never be larger than twice the logarithm of the dimension 
of the quantum register A 2 . Thus the following inequality applies 

n(C + 2Q) < I (AX; B n ) u + n(c + 2Q) + n5', (24.72) 

demonstrating that (24.8|) holds for the net rates. 



We now prove the second bound in (24.9). Consider the following chain of inequalities: 

n(Q + E) = I(R)B 1 )^ + I(S A )S B ) 9 (24.73) 

= I(RS a )B 1 Sb)zqz (24.74) 

<I{RS A )B l S B ) ul ,+n8 l (24.75) 

< I{RS A )B 1 S B M) U1 , (24.76) 

< I(RS A )B n A 2 T B LM)^ (24.77) 
= H(B n A 2 T B \LM) uj - H{RS A B n A 2 T B \LM) u} (24.78) 

< H(B n \LMl + H(A 2 \LMl + H(T B \LM) w 

-H{RS A B n A 2 T B \LM) u (24.79) 

< I{RS A A 2 T B )B n LM) LU +n(Q + E) (24.80) 

= I{A)B n X) uj + n(Q + Ey (24.81) 

The first equality follows by evaluating the coherent informations of the respective states 
(jj-RBi anc j §s A s B _ rp^ e seconc j e q Ua iity follows because § RBl <g> $ T ^ r s [ s a product state. The 
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first inequality follows from the condition in (24.5) and an application of the Alicki-Fannes' 
inequality with 5' vanishing when e — > 0. We suppress the term nS 1 in the following lines. The 
next two inequalities follow from quantum data processing. The third equality follows from 
the definition of coherent information. The fourth inequality follows from subadditivity of 
entropy. The fifth inequality follows from the definition of coherent information and the fact 
that the entropy can never be larger than the logarithm of the dimension of the corresponding 
system. The final equality follows from the definitions A = RTbA2Sa and X = ML. Thus 
the following inequality applies 



n 



(Q + E) < I{A)B n X) +n(d + Ey 



(24.82) 



demonstrating that (24.9) holds for the net rates. 



We prove the last bound in (24.10). Consider the following chain of inequalities: 



n 



{C + Q 



E) = I(M; M) ¥ + I(RS A )B 1 S B )^ 

< I(M; M)u, + I(RS A )B 1 S B ) ul + nS' 

< I(M; B n A 2 T B L) UJ + I(RS A )B n A 2 T B LM) uj 

= I(ML; B n A 2 T B ) u + I(M; L) u - I(A 2 B n T B ; L\ 
+ H(B n \LM) + H{A 2 T B \B n LM) ul 
-H{RS A A 2 T B B n \LM) u 
= I(ML; B n )^ + I(ML; A 2 T B \B n ) w 
+ I(M-L) u -I{A 2 B n T B -L) uj 
+ H{A 2 T B \B n LM)^ + I(RS A A 2 T B )B n LM) u _ 



(24.83) 

(24.84) 
(24.85) 



(24.86) 



(24.87) 



-MM 



The first equality follows from evaluating the mutual information of the state $ and the 
coherent information of the product state $ RBl ® $ s aSb The fl rs t inequality follows from 



the condition in (24.5) and an application of the Alicki-Fannes' inequality with 5' vanishing 
when e — > 0. We suppress the term nd' in the following lines. The second inequality 
follows from quantum data processing. The second equality follows from applying the chain 
rule for quantum mutual information to I(M; B n A 2 T B L) LU and by expanding the coherent 
information I(RS A )B n A 2 T B LM) uj . The third equality follows from applying the chain rule 
for quantum mutual information to I(ML; B n A 2 T B ) uj and from the definition of coherent 
information. We continue below: 



I(ML; B n )^ + I{RS A A 2 T B )B n LM) u] 

+ I{ML ] A 2 T B \B n ) u} + I{M ] L) u} 
- I(A 2 B n T B - L) u + tf 04 2 T B |£T\LM) w 
I(ML; B n )„ + I{RS A A 2 T B )B n LM) u] 

+ H(A 2 T B \B n ) u + I(M; L) - I(A 2 B n T B ; L\ 



(24.88) 
(24.89) 
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< I {ML; B n ) u + I{RS A A 2 T B )B n LM) ui +n(c + Q + ETJ 
= I(X; B») w + I(A)B n X) w + n (c + Q + E 



(24.90) 
(24.91) 



The first equality follows by rearranging terms. The second equality follows by canceling 
terms. The inequality follows from the fact that the entropy H{A2Tb\B ti ) uj can never be 
larger than the logarithm of the dimension of the systems A 2 Tb, that the mutual information 
I(M; L) u can never be larger than the logarithm of the dimension of the classical register 
L, and because I(A 2 B n TB', L) u > 0. The last equality follows from the definitions A = 
RTbA2Sa and X = ML. Thus the following inequality holds 



n 



(C + Q + E) < I{X- B n ) u + I{A)B n X) u + n(d + Q + Er)+ nS', 



(24.92) 



demonstrating that the inequality in (24.10) applies to the net rates. This concludes the 
catalytic proof of the converse theorem. 



24.4.1 Pure state ensembles are sufficient 

We prove that it is sufficient to consider an ensemble of pure states as in the statement of 



Theorem 24.2.1 rather than an ensemble of mixed states as in (24.48) in the proof of our 



converse theorem. We first determine a spectral decomposition of the mixed state ensemble, 
model the index of the pure states in the decomposition as a classical variable Y, and then 
place this classical variable Y in a classical register. It follows that the communication rates 
can only improve, and it is sufficient to consider an ensemble of pure states. 



Consider that each mixed state in the ensemble in (24.48 ) admits a spectral decomposition 
of the following form: 

" AA ' = Y,p(y\ x ^'- ( 24 - 93 ) 

y 



Px 



We can thus represent the ensemble as follows: 



XAB 



"£p(x)p(y\x)\x)(x\ X ®N A '^ B {^' 



(24.94) 



The inequalities in (24.8 24.10) for the dynamic capacity region involve the mutual informa- 
tion I(AX; B) p , the Holevo information I(X; B) p , and the coherent information I(A)BX) p . 
As we show below, each of these entropic quantities can only improve in each case if we make 
the variable y be part of the classical variable. This improvement then implies that it is only 
necessary to consider pure states in the dynamic capacity theorem. 
Let XYAB denote an augmented state of the following form: 



9 xyab = J2p(x)p(y\x)\x)(x\ x (8 \y)(y\ Y ®N J 



>B^AA> 



x,y 



(24.95) 
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This state is actually a state of the form in (24.11) if we subsume the classical variables 
X and Y into one classical variable. The following three inequalities each follow from an 
application of the quantum data processing inequality: 

I(X; B) p = I(X; B) < I(XY; B) g , (24.96) 

I(AX; B) p = I(AX; B) e < I(AXY; B) e (24.97) 

I(A)BX) p = I(A)BX) g < I(A)BXY) e . (24.98) 

Each of these inequalities proves the desired result for the respective Holevo information, 
mutual information, and coherent information, and it suffices to consider an ensemble of 



pure states in Theorem 24.2.1. 



24.4.2 The Quantum Dynamic Capacity Formula 

We introduce the quantum dynamic capacity formula and show how additivity of it implies 
that the computation of the Pareto optimal trade-off surface of the capacity region requires 
an optimization over a single channel use, rather than an infinite number of them. The 
Pareto optimal trade-off surface consists of all points in the capacity region that are Pareto 
optimal, in the sense that it is not possible to make improvements in one resource without 
offsetting another resource (these are essentially the boundary points of the region in our 
case). We then show how several important capacity formulas discussed previously in this 
book are special cases of the quantum dynamic capacity formula. 

Definition 24.4.1 (Quantum Dynamic Capacity Formula). The quantum dynamic capacity 
formula of a quantum channel J\f is as follows: 

D Klx {N) = max/(AX; B) a + XI(A)BX) <7 + ^I(X; B) a + I(A)BX) ff ), (24.99) 



where a is a state of the form in (24-11), A,/z > 0, and these parameters A and fi play the 
role of Lagrange multipliers. 

Definition 24.4.2. The regularized quantum dynamic capacity formula is as follows: 

D^{U)^Ynn l -D^{U^ k ). (24.100) 

Lemma 24.4.1. Suppose the quantum dynamic capacity formula is additive for a channel AT 
and any other arbitrary channel M. : 

D X , P {N ®M) = D X ^{N) + D Xttl (M). (24.101) 

Then the regularized quantum dynamic capacity formula for J\f is equal to the quantum 
dynamic capacity formula: 

DZW = D ^W- (24-102) 

In this sense, the regularized formula "single-letterizes" and it is not necessary to take the 
limit. 
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We prove the result using induction on n. The base case for n = 1 is trivial. Suppose the 
result holds for n: D\^(J\f® n ) = nD\^(J\f). Then the following chain of equalities proves 
the inductive step: 

D x ,nW® n+1 ) = D X ^{M ® M® n ) (24.103) 

= D x jAf) + D x jAf® n ) (24.104) 

= D x ^{M)+nD Kll {M). (24.105) 

The first equality follows by expanding the tensor product. The second critical equality 
follows from the assumption that the formula is additive. The final equality follows from the 
induction hypothesis. 

Theorem 24.4.1. Single- letterization of the quantum dynamic capacity formula implies 
that the computation of the Pareto optimal trade-off surface of the dynamic capacity region 
requires an optimization over a single channel use. 

We employ ideas from optimization theory for the proof (see Ref. |45]). We would like 
to characterize all the points in the capacity region that are Pareto optimal. Such a task 
is standard vector optimization in the theory of Pareto trade-off analysis (see Section 4.7 of 
Ref. [45J). We can phrase the optimization task as the following scalarization of the vector 
optimization task: 

max w c C + w Q + w E E (24.106) 

C,Q,E,p(x),cf> x 

subject to 

C + 2Q<I(AX;B n ) a , (24.107) 

Q + E < I(A)B n X) a , (24.108) 

C + Q + E < I(X; B n ) a + I(A)B n X) <r , (24.109) 

where the maximization is over all C, Q, and E and over probability distributions px(x) 
and bipartite states 4>^ An . The geometric interpretation of the scalarization task is that 
we are trying to find a supporting plane of the dynamic capacity region where the weight 
vector {we-, wq, we) is the normal vector of the plane and the value of its inner product with 
(C, <5, E) characterizes the offset of the plane. 

The Lagrangian of the above optimization problem is 

c(c,Q,E,p x (x),cl>^ A ' n ,X 1 ,X 2 ,X 3 ^ = 

w c C + w Q Q + w E E + Ai(/(AX; B n ) a - (C + 2Q)) 
+ X 2 (I(A)B n X) a -(Q + E)) 
+ X 3 (I(X; B n ) a + I(A)B n X) a -(C + Q + E)), (24.110) 

and the Lagrange dual function g [45] is 

g(X 1 ,X 2 ,X 3 ) = sup c(c,Q,E,p x (x),^ A ' n ,X u X 2 ,xX (24.111) 

C,Q,E,p{x),<f>^ 
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where Ai, A2, A3 > 0. The optimization task simplifies if the Lagrange dual function does. 
Thus, we rewrite the Lagrange dual function as follows: 

#(Ai,A 2 ,A 3 ) 

sup w c C + w Q Q + w E E + \ l (I(AX;B n ) (j -(C + 2Q)) (24.112) 

C,Q,E,p(x),</> AA ' n 

+ X 2 (I(A)B n X) a -(Q + E)) 

+ \ 3 (I(X- B n ) a + I(A)B n X) a -(C + Q + E)) (24.113) 

= sup (w c - Ai - A 3 )C + (w Q - 2Ai - A 2 - A 3 )Q + {w E - A 2 - A 3 )S 

C,Q,E,p(m),^ AA ' n 

+ Ai (/(AY; 5"), + ^/(v4)^X) ct + ^(I(X; B n ) a + I(A>J9»X) j) (24.114) 

= sup (w c - Ai - A 3 )C + (w Q - 2Ai - A 2 - A 3 )Q + (w E - A 2 - \ 3 )E 

C,Q,E 

+ \J max, I(AX ] B») a + ^I(A)B»X)„+^(I(X ] B n ) v + I(A)B n X)„)). 

\ P {x),<t> AA ' n Ai Ai / 

(24.115) 

The first equality follows by definition. The second equality follows from some algebra, and 
the last follows because the Lagrange dual function factors into two separate optimization 
tasks: one over C, Q, and E and another that is equivalent to the quantum dynamic capacity 
formula with A = A 2 /Ai and \x = A3/A1. Thus, the computation of the Pareto optimal trade- 
off surface requires just a single use of the channel if the quantum dynamic capacity formula 



in (24.99) single-letterizes. 



Special cases of the quantum dynamic capacity formula 

We now show how several capacity formulas of a quantum channel, including the entanglement- 



assisted classical capacity (Theorem 20.3.1 ), the quantum capacity formula (Theorem 23.3.1 ), 



and the classical capacity formula (Theorem 19.3.1 ) are special cases of the quantum dynamic 
capacity formula. 

We first give a geometric interpretation of these special cases before proceeding to the 
proofs. Recall that the dynamic capacity region has the simple interpretation as a translation 
of the three-faced unit resource capacity region along the trade-off curve for entanglement- 



assisted classical and quantum communication (see Figure 24.4 for the example of the region 



of the dephasing channel). Any particular weight vector (wc,u>q,We) in (24.106) gives 
a set of parallel planes that slice through the (C, Q, E) space, and the goal of the scalar 
optimization task is to find one of these planes that is a supporting plane, intersecting 
a point (or a set of points) on the trade-off surface of the dynamic capacity region. We 
consider three special planes: 

1. The first corresponds to the plane containing the vectors of super-dense coding and 
teleportation. The normal vector of this plane is (1, 2, 0), and suppose that we set the 
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weight vector in (24.106) to be this vector. Then the optimization program finds the 



set of points on the trade-off surface such that a plane with this normal vector is a 



supporting plane for the region. The optimization program singles out (24.107), and 
we can think of this as being equivalent to setting A2,A3 = in the Lagrange dual 
function. We show below that the optimization program becomes equivalent to finding 



the entanglement-assisted capacity (Theorem 20.3.1), in the sense that the quantum 



dynamic capacity formula becomes the entanglement-assisted capacity formula. 



2. The next plane contains the vectors of teleportation and entanglement distribution. 
The normal vector of this plane is (0, 1, 1). Setting the weight vector in (24.106) to 



be this vector makes the optimization program single out (24.108), and we can think 
of this as being equivalent to setting Ai,A3 = in the Lagrange dual function. We 
show below that the optimization program becomes equivalent to finding the quantum 



capacity (Theorem 23.3.1), in the sense that the quantum dynamic capacity formula 
becomes the LSD formula for the quantum capacity. 



3. A third plane contains the vectors of super-dense coding and entanglement distribution. 
The normal vector of this plane is (1, 1, 1). Setting the weight vector in (24.106) to 



be this vector makes the optimization program single out (24.109), and we can think 
of this as being equivalent to setting Ai,A2 = in the Lagrange dual function. We 
show below that the optimization becomes equivalent to finding the classical capacity 



(Theorem 19.3.1), in the sense that the quantum dynamic capacity formula becomes 



the HSW formula for the classical capacity. 

Corollary 24.4.1. The quantum dynamic capacity formula is equivalent to the entanglement- 
assisted classical capacity formula when A,/i = 0, in the sense that 



max I (AX; B) 



max I (A; B). 

J.AA' 



(24.116) 



Proof. The inequality m&x a I(AX; B) > max,4A' I(A;B) follows because the state a is of 
the form in (p4.ll) and we can always choose px(x) = 5 x>Xo and <^ A ' to be the state that 
maximizes I(A; B). We now show the other inequality max ff I(AX; B) < max, A A' I(A;B). 
First, consider that the following chain of equalities holds for any state cj) ABE resulting from 
the isometric extension of the channel: 



I{A; B) = H{B) + H(A) - H{AB) 
= H(B) + H(BE) -H(E) 
= H{B) + H{B\E). 



(24.117) 
(24.118) 



In this way, we see that the mutual information is purely a function of the channel input 
density operator Tta{4> AA }• Then consider any state a of the form in (24.11 ). The following 
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chain of inequalities holds 

I(AX; B) a = H(A\X) a + H(B) a - H(E\X) a (24.119) 

= H(BE\X) a + H(B) a - H(E\X) a (24.120) 

= H(B\EX) a + H(B) a (24.121) 

<H(B\E) a + H(B) a (24.122) 

<maxI(A;B). (24.123) 

4> AA ' 

The first equality follows by expanding the mutual information. The second equality follows 
because the state on ABE is pure when conditioned on X. The third equality follows from 
the entropy chain rule. The first inequality follows from strong subadditivity, and the last 
follows because the state after tracing out systems X and A is a particular state that arises 
from the channel and cannot be larger than the maximum. □ 

Corollary 24.4.2. The quantum dynamic capacity formula is equivalent to the LSD quantum 
capacity formula in the limit where A — > oo and [x is fixed, in the sense that 

max 1(A) BX) = max 1(A) B). (24.124) 

Proof. The inequality max CT I(A)BX) > max, A A' I(A)B) follows because the state a is of 



the form in (24.11) and we can always choose Px( x ) = 8 X;XQ an d <f>^ to be the state 
that maximizes I(A)B). The inequality rn.ax.tr I (A) B X) < max^A' I(A)B) follows because 
1(A) BX) = Yl x Px(x)I(A)B), and the maximum is always greater than the average. □ 

Corollary 24.4.3. The quantum dynamic capacity formula is equivalent to the HSW clas- 
sical capacity formula in the limit where \x — > oo and A is fixed, in the sense that 

ma,xI(A)BX) a + I(X;B) a = max I(X;B). (24.125) 

Proof. The inequality max ff I(A)BX) a + I(X; B) a > max{ Px ( x .) ^} I(X; B) follows by choos- 
ing a to be the pure ensemble that maximizes I(X; B) and noting that I(A)BX) a van- 
ishes for a pure ensemble. We now prove the inequality max CT I(A)BX) a + I(X;B) a < 
m.axip x f x \tf\I(X;B). Consider a state uo XYBE obtained by performing a von Neumann 
measurement on the A system of the state a XABE . Then 

I(A)BX) a 



r (X;B) a = H(B) a -H(E\X) a 


(24.126) 


= H(B) U -H(E\X) U 


(24.127) 


<H(B) w -H(E\XY) uj 


(24.128) 


H(B) uj -H(B\XY) u 


(24.129) 


KXY;B) U 


(24.130) 


max I(X;B). 


(24.131) 



{px(x),i>x} 
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The first equality follows by expanding the conditional coherent information and the Holevo 
information. The second equality follows because the measured A system is not involved in 
the entropies. The first inequality follows because conditioning does not increase entropy. 
The third equality follows because the state u) is pure when conditioned on X and Y . The 
fourth equality follows by definition, and the last inequality follows for clear reasons. □ 



24.5 Examples of Channels 



In this final section, we prove that a broad class of channels, known as the Hadamard channels 



(see Section 5.2.4), have a single-letter dynamic capacity region. We prove this result by 
analyzing the quantum dynamic capacity formula for this class of channels. A dephasing 
channel is a special case of a Hadamard channel, and so we can compute its dynamic capacity 
region. 

We also overview the dynamic capacity region of a lossy bosonic channel, which is a good 
model for free-space communication or loss in an optical fiber. Though, we only state the 
main results and do not get into too many details of this channel (doing so requires the 
theory of quantum optics and infinite- dimensional Hilbert spaces which is beyond the scope 
of this book). The upshot for this channel is that trade-off coding can give remarkable gains 
over time-sharing. 

24.5.1 Quantum Hadamard channels 



Below we show that the regularization in (24.6) is not necessary if the quantum channel is a 



Hadamard channel. This result holds because a Hadamard channel has a special structure 



(see Section 5.2.4). 



Theorem 24.5.1. The dynamic capacity region Ccqe(Nh) of a quantum Hadamard channel 
Mh is equal to its one-shot region C^qe (Nh) ■ 

The proof of the above theorem follows in two parts: 1) the below lemma shows the 
quantum dynamic capacity formula is additive when one of the channels is Hadamard and 



2) the induction argument in Lemma 24.4.1 that proves single-letterization. 



Lemma 24.5.1. The following additivity relation holds for a Hadamard channel Mh and 
any other channel M : 

D^{N H ®N) = D x ^{Nh) + D^{N). (24.132) 

We first note that the inequality D\^(Nh <8> N) > D x ^(Mh) + D\^{N) holds for any 
two channels simply by selecting the state a in the maximization to be a tensor product of 
the ones that individually maximize D\^(Nh) and D\ tfl (J\f). 

So we prove that the non-trivial inequality D\^(Nh ®N) < D\^(J\[h) + D\ tfJi (J\f) holds 
when the first channel is a Hadamard channel. Since the first channel is Hadamard, it is 
degradable and its degrading map has a particular structure: there are maps T> 1 1_> and 
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T> 2 ~* * where F is a classical register and such that the degrading map is T> 2 ~~* x o 2? x 1_> . 
Suppose the state we are considering to input to the tensor product channel is 

p xaa> iA > s £ pjr(a; )| x )( x |* g, ^ ; (24.133) 

rr 

and this state is the one that maximizes D\^{Nh ®N). Suppose that the output of the first 
channel is 

9 XAB lEl A> = ^B^^xaa^^ ^^ 

and the output of the second channel is 



iO 



XAB 1 E 1 B 2 E 2 — TT A ' 2 ^ B 2 E 2(nXAB l E 1 A 



U^—^QXAB^A^y (24.135) 



Finally, we define the following state as the result of applying the first part of the Hadamard 
degrading map (a von Neumann measurement) to uo: 



a 



XYAEiB 2 E 2 — -n-Bi^Y (, ,XABiEiB 2 E 2 



V ^Y ^XAB^B 2 E 2 y (24.136) 



In particular, the state a on systems AE\B 2 E 2 is pure when conditioned on X and Y . Then 
the following chain of inequalities holds 

= I(AX; B.B,)^ + XI(A)B 1 B 2 X) U} + /i(I(X; B 1 B 2 ) uj + I(A)B 1 B 2 X)J (24.137) 

= H(B 1 B 2 E 1 E 2 \X) uj + XH(B 1 B 2 \X) U} + (// + 1)H(B 1 B 2 ) U} 

_ (yX + fl + l)H(E 1 E 2 \X) uj (24.138) 

= H(B 1 E 1 \X) 0J + XH(B X \X) U + (/* + 1)H(B 1 ) 0J - (A + /i + l)tf (i^X)^ 
H(B 2 E 2 \B 1 E 1 X) U + XHiB^BxX)^ + (// + ^(B^B,)^ 

- (A + /i + 1)#(£ 2 |#iX) w (24.139) 

< H{B x E 1 \X) e + Atf^pO, + (// + l)H(BO e - (A + // + l)^^*)^ 

H{B 2 E 2 \YX) a + A#(£ 2 |FX) a + (// + l)if(£ 2 ) ff - (A + // + l)if(£ 2 |yx) ff (24.140) 

= J(AA' 2 X; SOe + A/^A'^X), + //(/(X; B^ + /(AA' 2 )i3 1 X) e ) + 

/(A^rX; B 2 ) ct + A/(^i)B 2 y^) ff + ^(/(FI; B 2 ) a + /(AE^FX) J (24.141) 

^D^Njfi + D^N). (24.142) 

The first equality follows by evaluating the quantum dynamic capacity formula D\^(Nh <8> A/") 
on the state p. The next two equalities follow by rearranging entropies and because the state 
uo on systems AB\E\B 2 E 2 is pure when conditioned on X. The inequality in the middle is the 
crucial one and follows from the Hadamard structure of the channel: we exploit monotonic- 
ity of conditional entropy under quantum operations so that H{B 2 \B 1 X) bJ < H(B 2 \YX) a , 
H{B 2 E 2 \B l E l X) u < H(B 2 E 2 \YX) a , and H{E 2 \YX) a < H(E 2 \E 1 X) uj . It also follows be- 
cause H{B 2 \Bi) uj < H(B 2 ) uj . The next equality follows by rearranging entropies and the 



final inequality follows because 9 is a state of the form (24.11) for the first channel while a 



is a state of the form (24.11) for the second channel. 
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Quantum communication rate 



Classical communication rate 



Figure 24.3: A plot of the dynamic capacity region for a qubit dephasing channel with dephasing parameter 
p = 0.2. The plot shows that the CEF trade-off curve (the protocol from Corollary 21.5.2) lies along the 



boundary of the dynamic capacity region. The rest of the region is simply the combination of the CEF points 
with the unit protocols teleportation (TP), super-dense coding (SD), and entanglement distribution (ED). 



24.5.2 The Dephasing Channel 

The below theorem shows that the full dynamic capacity region admits a particularly simple 
form when the noisy quantum channel is a qubit dephasing channel A p where 



A p (p) = (l-p)p + pA(p), 

A(p)^(0H0)|0)(0| + (l|p|l)|l}(l|. 



(24.143) 
(24.144) 



Figure 24.3 plots this region for the case of a dephasing channel with dephasing parameter 
p = 0.2. Figure [24.4| plots special two-dimensional cases of the full region for various values 



of the dephasing parameter p. The figure demonstrates that trade-off coding just barely 
beats time-sharing. 

Theorem 24.5.2. The dynamic capacity region Ccqe{A p ) of a dephasing channel with de- 
phasing parameter p is the set of all C , Q, and E such that 



C + 2Q<1 + H 2 (u)-H 2 ( 1 (u,p)), 
Q + E<H 2 (u)-H 2 ( 7 (u,p)), 
C + Q + E<l-H 2 ( 7 {u,p)), 

where v G [0,1/2], H 2 is the binary entropy function, and 



7(z/,p) 



1 



16.|.1 



p 



1/(1 



(24.145) 
(24.146) 
(24.147) 



(24.148) 
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Figure 24.4: Plot of (a) the CQ trade-off curve and (b) the CE trade-off curve for a p-dephasing qubit 
channel for p = 0, 0.1, 0.2, . . . , 0.9, 1. The trade-off curves for p = correspond to those of a noiseless qubit 
channel and are the rightmost trade-off curve in each plot. The trade-off curves for p = 1 correspond to 
those for a classical channel, and are the leftmost trade-off curves in each plot. Each trade-off curve between 
these two extremes beats a time-sharing strategy, but these two extremes do not beat time-sharing. 



We first notice that it suffices to consider an ensemble of pure states whose reductions to 
A' are diagonal in the dephasing basis (see the following exercise). 

Exercise 24.5.1 Prove that the following properties hold for a generalized dephasing channel 
Mdi its complement Afp, the completely dephasing channel A, and all input states p: 



Af D {A( P ))=A(Af D (p)), 
M c D {A{p))=N c D {p). 



Conclude that 



H(p)<H{A(p)), 
H{M D {p)) < H{A{M D {p))) = H(M D (A(p))), 
H{N c M) = H(N c D (A{p))), 

so that it suffices to consider diagonal input states for the dephasing channel. 



(24.149) 
(24.150) 

(24.151) 
(24.152) 
(24.153) 



Next we prove below that it is sufficient to consider an ensemble of the following form to 
characterize the boundary points of the region: 



where %j) AA and i\) AA are pure states, defined as follows for v G [0, 1/2]: 



Tr A 
Tr A 



{*r} 



i-i' 



i/io>(or+(i-i/)ii><i 



A' 



(i-i/)io><or +hi)(i 



A' 



i A' 



(24.154) 

(24.155) 
(24.156) 
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We now prove the above claim. We assume without loss of generality that the dephasing 
basis is the computational basis. Consider a classical-quantum state with a finite number iV 
of conditional density operators (j) AA whose reduction to A' is diagonal: 



N-l 



^XAA' 



J2px(x)\x){x\ 



X 



lAA' 



(24.157) 



x=0 



We can form a new classical- quantum state with double the number of conditional density 
operators by "bit-flipping" the original conditional density operators: 



XAA' 



N-l 

- Y, Px{x) (\x)(x\ x <g> <f> AA ' + \x + N)(x + N\ x <g> X A ' <\> AA ' X A '^ , (24.156 

x=0 



where X is the ax "bit-flip" Pauli operator. Consider the following chain of inequalities that 
holds for all A, \x > 0: 



I(AX; B) p + XI{A)BX) p + fx(l(X; B) p + I{A)BX) p 
= H(A\X) p + (/x + l)H(B) p + XH(B\X) p - (A + /x + l)H{E\X) p 
<(V + l)H(B) a + H{A\X) a + XH(B\X) a - (A + p + 1)H(E\X) 
= (jm + 1) + H(A\X) a + XH(B\X) a - (A + ix + l)ff(£|X)„ 

= (/x + 1) + Y. Vx{x) \H(A)^ + XH(B)^ - (A + ^ + l)H(E) 

< {V + 1) + 
= (A* + 1) + 



max |ff (A)^ + AF(B)^ - (A + p + 1)H(E\ 
H(A), t + XH(B),, - (A + ix + 1)#(£U. 



(24.159) 
(24.160) 
(24.161) 

(24.162) 

(24.163) 
(24.164) 



The first equality follows by standard entropic manipulations. The second equality follows 
because the conditional entropy H(B\X) is invariant under a bit-flipping unitary on the 
input state that commutes with the channel: H(B) Xp Bx = H{B) p B. Furthermore, a bit flip 
on the input state does not change the eigenvalues for the output of the dephasing channel's 
complementary channel: 

H(E) mXp ^ x) = H(E) mf4 , y (24.165) 

The first inequality follows because entropy is concave, i.e., the local state a B is a mixed 
version of p B . The third equality follows because 



H 



(B) aB =Hr£±p x (x)(p B + Xp B X)) =Hr-Ypx(x)A =1. 



(24.166) 



The fourth equality follows because the system X is classical. The second inequality fol- 
lows because the maximum value of a realization of a random variable is not less than its 
expectation. The final equality simply follows by defining 0* to be the conditional density 
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operator on systems A, B, and E that arises from sending through the channel a state whose 
reduction to A' is of the form i/|0)(0| + (1 — ^)|1}(1| . Thus, an ensemble of the kind in 



(24.154) is sufficient to attain a point on the boundary of the region. 



Evaluating the entropic quantities in Theorem 24.2.1 on a state of the above form then 



gives the expression for the region in Theorem 24.5.2 



24.5.3 The Lossy Bosonic Channel 

One of the most important practical channels in quantum communication is known as the 
lossy bosonic channel. This channel can model the communication of photons through 
free space or over a fiber optic cable because the main source of noise in these settings 
is just the loss of photons. The lossy bosonic channel has one parameter rj G [0, 1] that 
characterizes the fraction of photons that make it through the channel to the receiver on 
average. The environment Eve is able to collect all of the photons that do not make it to 
the receiver — this fraction is 1 — rj. Usually, we also restrict the mean number of photons 
that the sender is allowed to send through the channel (if we do not do so, then there could 
be an infinite amount of energy available, which is unphysical from a practical perspective, 
and furthermore, some of the capacities become infinite, which is less interesting from a 
theoretical perspective). So, we let Ng be the mean number of photons available at the 
transmitter. Capacities of this channel are then a function of these two parameters rj and N$- 

Exercise 24.5.2 Prove that the quantum capacity of a lossy bosonic channel vanishes when 
77=1/2. 

In this section, we show how trade-off coding for this channel can give a remarkable gain 
over time-sharing. Trade-off coding for this channel amounts to a power-sharing strategy, 
in which the sender dedicates a fraction A of the available photons to the quantum part 
of the code and the other fraction 1 — A to the classical part of the code. This power- 
sharing strategy is provably optimal (up to a long-standing conjecture) and can beat time- 
sharing by significant margins (much more so than the dephasing channel does, for example). 
Specifically, recall that a trade-off coding strategy has the sender and receiver generate 
random codes from an ensemble of the following form: 



{px{x)M x ) AA '}, (24.167) 



X AA' 



where Px(%) is some distribution and the states \4> x ) are correlated with this distribution, 
with Alice feeding system A 1 into the channel. For the lossy bosonic channel, it turns out 
that the best ensemble to choose is of the following form: 

{p { i-x )Ns (a),D A '(a)\^ TMS ) AA 'Y (24.168) 

where a is a complex variable. The distribution p^_x)N s { a ) ^ s an isotropic Gaussian distri- 
bution with variance (1 — X)Ns~. 

p { i-x)N s (a) = \ exp{-H 2 /[(l - X)N S ]}, (24.169) 

7T{1 — A)1\S 
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where A G [0, 1] is the power-sharing or photon-number-sharing parameter, indicating how 
many photons to dedicate to the quantum part of the code, while 1 — A indicates how many 



photons to dedicate to the classical part. In (24.168), D A (a) is a "displacement" unitary 
operator acting on system A' (more on this below), and |"0tms) is a "two-mode squeezed" 
(TMS) state of the following form: 



„=o V i AiV s + L \ 
Let 9 denote the state resulting from tracing over the mode A: 

9 = Tr^lVW^TMsl^ 4 '} (24.171) 

= f [XNs]n +1 \n)(n\ A '. (24.172) 

^ [XN s + l] n+l1 

The reduced state 6 is known as a thermal state with mean photon number XNs- We can 
readily check that its mean photon number is XNs simply by computing the expectation of 
the photon number n with respect to the geometric distribution [XNs] n /[XNs + l] n : 

yn [XNsr +1 = XN S . (24.173) 

^ Q [XN s + ir l 

The most important property of the displacement operators D (a) for our purposes is that 
averaging over a random choice of them according to the Gaussian distribution p(i_A)7v s (a;), 
where each operator acts on the state 0, gives a thermal state with mean photon number N$: 

da p {1 _ x)Ns {a) D(a)6D\a) (24.174) 

Y [NsT +1 \n){nf. (24.175) 



Thus, the choice of ensemble in (24.168) meets the constraint that the average number of 
photons input to the channel be equal to N$. 

In order to calculate the quantum dynamic capacity region for this lossy bosonic channel, 
it is helpful to observe that the entropy of a thermal state with mean number of photons N$ 
is equal to 

g(N s ) = (N s + l)log 2 (iV s + 1) - iV 5 log 2 (iV5), (24.176) 



because we will evaluate all of the relevant entropies on thermal states. From Exercise 24.2.1 
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we know that we should evaluate just the following four entropies: 

H(A\X) C = J da p ( i_A)iv s (a) H(D{a)6D\a)), (24.177) 

H(B) a = H(N(0)), (24.178) 

H{B\X) a = J da P( i-x )Ns (a) H{Af(D(a)9tf(a))), (24.179) 

H{E\X) a = J da V{i-X)N s {oc) H{N c {D{a)9D\a))), (24.180) 

where J\f is the lossy bosonic channel that transmits r\ of the input photons to the receiver 
and M c is the complementary channel that transmits 1 — r\ of the input photons to the 
environment Eve. We proceed with calculating the above four entropies: 

J da P{1 . x)Ns {a) H(D{a)9D\a)) = J da p {1 _ x)Ns {a) H{9) (24.181) 

= H{9) (24.182) 

= g(\N s ) (24.183) 

The first equality follows because D(a) is a unitary operator, and the third equality follows 
because 9 is a thermal state with mean photon number N$- Continuing, we have 

H(Af(9))=g( V N s ), (24.184) 

because 9 is a thermal state with mean photon number N$, but the channel only lets a 



fraction r] of the input photons through on average. The third entropy in (24.179) is equal 
to 

J da P{1 -x)N s (a) H{N(D(a)9D\a))) 

= I da p {l _ X )N s {a) H{D(a)M{9)D\a)) (24.185) 

= j da p { i- X )N s («) H{N{9)) (24.186) 

= H(Af(9)) (24.187) 

= 9(\vNs) (24.188) 

The first equality follows because a displacement operator commutes with the channel (we 
do not justify this rigorously here). The second equality follows because D(a) is a unitary 
operator. The final equality follows because 9 is a thermal state with mean photon number 
AA^5, but the channel only lets a fraction rj of the inputs photons through on average. By 
the same line of reasoning (except that the complementary channel lets through only a 
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fraction 1 — r\ of the input photons), the fourth entropy in (24.180) is equal to 



da P{1 _ x)Ns {a) H(N c (D(a)9D\a))) 

I da P(i_ A )j\fe(a) H{D{ol)N c {0)D\ol)) (24.189) 

da p {1 _X) Ns {a) H(N C {6)) (24.190) 



H(M C {9)) (24.191) 

g(X(l- V )N s ). (24.192) 



Then, by the result of Exercise 24.2.1 and a matching converse that holds whenever n > 1/2 f\ 
we have the following characterization of the quantum dynamic capacity region of the lossy 
bosonic channel. 

Theorem 24.5.3. The quantum dynamic capacity region for a lossy bosonic channel with 
transmissivity 77 > 1/2 is the union of regions of the form: 

C + 2Q< g(XNs) + g^Ng) - <?((1 - ri)\N s ), (24.193) 

Q + E < g(vXN s ) - g((l - v)*N s ), (24.194) 

C + Q + E < g( V N s ) - g((l - r/)AiV 5 ), (24.195) 

where A G [0, 1] is a photon-number-sharing parameter and g(N) is the entropy of a thermal 



state with mean photon number N defined in (24-176). The region is still achievable if 
r?< 1/2. 



Figure 24.5 depicts two important special cases of the region in the above theorem: (a) 
the trade-off between classical and quantum communication without entanglement assistance 
and (b) the trade-off between entanglement-assisted and unassisted classical communication. 
The figure indicates the remarkable improvement over time-sharing that trade-off coding 
gives. 

Other special cases of the above capacity region are the unassisted classical capacity 
g(r]Ns) when X,Q,E = 0, the quantum capacity g(rjNs) — g((l — tj)Ns) when A = 1, C, E = 
0, and the entanglement-assisted classical capacity g{Ns) + g{r]Ns) — <?((1 — v)^s) when 
A = 1, Q = 0, and E = —00. 

24.6 History and Further Reading 

Shor considered the classical capacity of a channel assisted by a finite amount of shared en- 
tanglement [229]. He calculated a trade-off curve that determines how a sender can optimally 



2 We should clarify that the converse holds only if a long-standing minimum-output entropy conjecture is 
true (researchers have collected much evidence that it should be true). 
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Classical -quantum trade-off for the lossy bosonic channel with i] - 3/4 



Classical-entanglement trade-off for the lossy bosonic channel with ri - 3/4 
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Figure 24.5: (a) Suppose a channel transmits on average 3/4 of the photons to the receiver, while losing 
the other 1/4 en route. Such a channel can reliably transmit a maximum of log 2 (3/4) — log 2 (l/4) « 1.58 
qubits per channel use, and a mean photon budget of about 200 photons per channel use at the transmitter 
is sufficient to nearly achieve this quantum capacity. A trade-off coding strategy which lowers the quantum 
data rate to about 1.4 qubits per channel use while retaining the same mean photon budget allows for a 
sender to reliably transmit an additional 4.5 classical bits per channel use, while time-sharing would only 
allow for an additional 1 classical bit per channel use with this photon budget. The 6.5 dB increase in the 
classical data rate that trade-off coding gives over time-sharing for this example is strong enough to demand 
that quantum communication engineers employ trade-off coding strategies in order to take advantage of such 
theoretical performance gains, (b) The sender and the receiver share entanglement, and the sender would like 
to transmit classical information while minimizing the consumption of entanglement. With a mean photon 
budget of 200 photons per channel use over a channel that propagates only 3/4 of the photons input to it, 
the sender can reliably transmit a maximum of about 10.7 classical bits per channel use while consuming 
entanglement at a rate of about 9.1 entangled bits per channel use. With trade-off coding, the sender can 
significantly reduce the entanglement consumption rate to about 5 entangled bits per channel use while 
still transmitting about 10.5 classical bits per channel use, only a 0.08 dB decrease in the rate of classical 
communication for a 2.6 dB decrease in the entanglement consumption rate. The savings in entanglement 
consumption could be useful for them if they would like to have the extra entanglement for future rounds of 
assisted communication. 
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trade the consumption of noiseless entanglement with the generation of noiseless classical 
communication. This trade-off curve also bounds a rate region consisting of rates of entan- 
glement consumption and generated classical communication. Shor's result then inspired 
Devetak and Shor to consider a scenario where a sender exploits a noisy quantum channel 
to simultaneously transmit both noiseless classical and quantum information [73], a scenario 
later dubbed "classically-enhanced quantum coding" [159] 1160] after schemes formulated in 
the theory of quantum error correction [178] 1249] . Devetak and Shor provided a multi- letter 
characterization of the classically-enhanced quantum capacity region for general channels, 
but they were able to show that both generalized dephasing channels and erasure channels 
admit single-letter capacity regions. 

The above scenarios are a part of the dynamic, double-resource quantum Shannon theory, 
where a sender can exploit a noisy quantum channel to generate two noiseless resources, or 
a sender can exploit a noisy quantum channel in addition to a noiseless resource to generate 
another noiseless resource. This theory culminated with the work of Devetak et al. that 
provided a multi-letter characterization for virtually every permutation of two resources and 
a noisy quantum channel which one can consider [TOl ITT] . Other researchers concurrently 
considered how noiseless resources might trade off against each other in tasks outside of the 
dynamic, double-resource quantum Shannon theory, such as quantum compression [HI 1133] 
I176J . remote state preparation [32] |4], and hybrid quantum memories |179j . 

Refs. [159] 1160] I253J considered the dynamic, triple-resource quantum Shannon theory 
by providing a multi-letter characterization of an entanglement-assisted quantum channel's 
ability to transmit both classical and quantum information. Ref. [159] also constructed a new 
protocol, dubbed the "classically-enhanced father protocol," that outperforms a time-sharing 
strategy for transmitting both classical and quantum information over an entanglement- 
assisted quantum channel. Bradler et al. showed that the quantum Hadamard channels 
have a single-letter capacity region (JBJ. Later studies continued these efforts of exploring 
information trade-offs [164] 1252] . 

Ref. [250j recently found the quantum dynamic capacity region of the lossy bosonic 
channel (up to a long-standing minimum-output entropy conjecture). The results there 
build on a tremendous body of literature for bosonic channels. Giovannetti et al. found 
the classical capacity of the lossy bosonic channel [104] . Others found the entanglement- 
assisted classical and quantum capacities of the lossy bosonic channel [3U 1141] 11071 1106] and 
its quantum capacity [263] 1120"] . The long-standing minimum-output entropy conjecture is 
detailed in Refs. [Ml CLU [HE]. 
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Summary and Outlook 



This brief final chapter serves as a compact summary of all the results presented in this 
book, it highlights information processing tasks that we did not cover, and it discusses new 
directions. We exploit the resource inequality formalism in our summary 
A resource inequality is a statement of achievability: 



£>* >!>,■, (25.1; 



meaning that the resources {ftfc} on the LHS can simulate the resources {(3j} on the RHS. 
The simulation can be exact and finite or asymptotically perfect. We can classify resources 
as follows: 

1. Unit, noiseless, or noisy. 

2. Dynamic or static. Moreover, dynamic resources can be relative (see below). 

3. Classical, quantum, or hybrid. 

The unit resources are as follows: [c — ► c] represents one noiseless classical bit channel, 
[q — > q] represents one noiseless qubit channel, [qq] represents one noiseless ebit, and [q — ► qq] 
represents one noiseless coherent bit channel. We also have [c — ► c] . as a noiseless private 
classical bit channel and [cc] iv as a noiseless bit of secret key. An example of a noiseless 

resource is a pure bipartite state \(f>) shared between Alice and Bob or an identity channel 
j-a^b f rom Aii ce to Bob. An example of a noisy resource could be a mixed bipartite state 
p AB or a noisy channel J\f A ^ B . Unit resources are a special case of noiseless resources, which 
are in turn a special case of noisy resources. 

A shared state p AB is an example of a noisy static resource, and a channel M is an example 
of a noisy dynamic resource. We indicate these by (p) or (A/"} in a resource inequality. We 
can be more precise if necessary and write (AT) as a dynamic, relative resource {H A ^ B : o A }, 
meaning that the protocol only works as it should if the state input to the channel is o A . 
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It is obvious when a resource is classical or when it is quantum, and an example of a 
hybrid resource is a classical-quantum state 

p XA = ^Px(x)\x)(x\ x ®p*. (25.2) 



25.1 Unit Protocols 



Chapter [6] discussed entanglement distribution 

[q^q]>[qq], (25.3 

teleportation 

2[c-c] + [ M ]>[g-< ? ], (25.4 

and super-dense coding 

[q^q) + [qq)>2[c^c). (25.5 

Chapter [7] introduced coherent dense coding 

[q^q] + [qq]>2[q^qq], (25.6 

and coherent teleportation 

2[q^qq]> [q ^ q] + [qq]. (25.7 

The fact that these two resource inequalities are dual under resource reversal implies the 
important coherent communication identity: 

2[q->qq] = [q->q] + [qq]. (25.8 

We also have the following resource inequalities: 

[q^q}> [q^qq]>[qq\. (25.9 

Other unit protocols not covered in this book are the one-time pad 

[<=- c] pub + M priv > [c - c] priv , (25.10 

secret key distribution 

[c - c] priv > [cc] priv , (25.11 

and private-to-public transmission 

[c - c] priv > [c - c] pub . (25.12 

The last protocol assumes a model where the receiver can locally copy information and place 
it in a register to which Eve has access. 
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25.2 Noiseless Quantum Shannon Theory 

Noiseless quantum Shannon theory consists of resource inequalities involving unit resources 
and one non-unit, noiseless resource, such as an identity channel or a pure bipartite state. 



Schumacher compression from Chapter [17j gives a way to simulate an identity channel 
l A ^ B acting on a mixed state p A by exploiting noiseless qubit channels at a rate equal to 
the entropy H(A) : 

H{A) p [q^q]>{l A - B :p A ). (25.13) 

We also know that if n uses of an identity channel are available, then achievability of the 



coherent information for quantum communication (Chapter 23) implies that we can send 
quantum data down this channel at a rate equal to H(B) — H(E), where the entropies are 
with respect to some input density operator p A . But H(E) = because the channel is the 
identity channel (the environment gets no information) and H(B) = H(A) because Alice's 
input goes directly to Bob. This gives us the following resource inequality: 

(l A - B :p A )>H(A) [q^q], (25.14) 



and combining (25.13) and (25.14) gives the following resource equality: 



(l A - B :p A ) = H(A) p [q^q]. (25.15) 



Entanglement concentration from Chapter [18] converts many copies of a pure, bipartite 
state \4>) into ebits at a rate equal to the entropy of entanglement: 

(O > H(A)+[qq\. (25.16) 

We did not discuss entanglement dilution in any detail |188[ll87lll24[ll35] . but it is a protocol 
that exploits a sublinear amount of classical communication to dilute ebits into n copies of 
a pure, bipartite state \(f>) . Ignoring the sublinear rate of classical communication gives 
the following resource inequality: 

H(A),[qq] > (c/> AB ). (25.17) 

Combining entanglement concentration and entanglement dilution gives the following re- 
source equality: 

(<P AB ) = H(A)^[qq]. (25.18) 

The noiseless quantum Shannon theory is satisfactory in the sense that we can obtain re- 
source equalities, illustrating the interconvertibility of noiseless qubit channels with a relative 
identity channel and pure, bipartite states with ebits. 
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25.3 Noisy Quantum Shannon Theory 

Noisy quantum Shannon theory has resource inequalities with one noisy resource, such as a 
noisy channel or a noisy state, interacting with other unit resources. We can further classify 
a resource inequality as dynamic or static, depending on whether the noisy resource involved 
is dynamic or static. 

We first review the dynamic resource inequalities presented in this book. These protocols 
involve a noisy channel interacting with the other unit resources. Many of the protocols in 
noisy quantum Shannon theory generate random codes from a state of the following form: 

p xabe s J2px(x)\x){x\ X ® U^ BE (^ A '), (25.19) 

x 

where (f) AA is a pure, bipartite state and Ufy^ BE is an isometric extension of a channel 
Af A ^ B . Also important is a special case of the above form: 



a 



ABE _ ttA'^BE/ iAA' 



Ufi-*"*^), (25.20) 



where (f> is a pure, bipartite state. Holevo-Schumacher- Westmoreland coding for classical 



communication over a quantum channel (Chapter 19) is the following resource inequality: 

(Af)>I(X;B) p [c^c}. (25.21) 

Devetak-Cai- Winter- Yeung coding for private classical communication over a quantum chan- 



nel (Chapter 22) is as follows: 

(AT) > (/(A; B) p - I(X; E) p ) [c - c] priv . (25.22) 

Upgrading the private classical code to one that operates coherently gives Devetak's method 



for coherent communication over a quantum channel (Chapter 23): 



(M)>I(A)B) a [q^qq], (25.23) 

which we showed can be converted asymptotically into a protocol for quantum communica- 
tion: 

(A/"} > I{A)B) a [q - q]. (25.24) 

Bennett-Shor-Smolin-Thapliyal coding for entanglement-assisted classical communication 



over a quantum channel (Chapter 20) is the following resource inequality: 

(A") + H(A) a [qq] > I(A; B)Jc - c]. (25.25) 

We showed how to upgrade this protocol to one for entanglement-assisted coherent commu- 



nication (Chapter 21): 



(A") + H{A) a [qq] > I (A; B) a [q - qq], (25.26) 
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and combining with the coherent communication identity gives the following protocol for 
entanglement-assisted quantum communication: 

(A/"} + l -I{A; E) a [qq] > \l{A- B) a [q - q\. (25.27) 



Further combining with entanglement distribution gives the resource inequality in (25.24) 
for quantum communication. By combining the HSW and BSST protocols together (this 
needs to be done at the level of coding and not at the level of resource inequalities — see 



Chapter 21 ), we recover a protocol for entanglement-assisted communication of classical and 



quantum information: 

(M) + h{A- E\X) a [qq] > l -I{A-B\X) a [q - q] + I(X; B) a \c - c). (25.28) 

This protocol recovers any protocol in dynamic quantum Shannon theory that involves a 
noisy channel and the three unit resources after combining it with the three unit protocols 



in (25.3 25.5). Important special cases are entanglement-assisted classical communication 
with limited entanglement: 

(AT) + H(A\X) a [qq) > I(AX; B) a [c - c), (25.29) 

and simultaneous classical and quantum communication: 

(A/") > I(X; B) a [c - c] + I(A)BX) a [q - q]. (25.30) 



Chapter 21 touched on some important protocols in static quantum Shannon theory. 



These protocols involve some noisy state p AB interacting with the unit resources. The 



protocol for coherent-assisted state transfer is the static couterpart to the protocol in (25.26): 



{W S^AB . pS) + H{A)p[q _^ q] > j {A . B)p[q ^ gq] + {I S^BB . p S h (25 31) 

where W is some isometry that distributes the state from a source S to two parties A and 
B and I S ^ BB is the identity. Ignoring the source and state transfer in the above protocol 
gives a protocol for quantum-assisted coherent communication: 

(p) + H(A)\q -^q]> I(A; B) \q - qq]. (25.32) 



We can also combine (25.31 ) with the unit protocols to obtain quantum- assisted state trans- 
fer: 

(W S ^ AB : p S ) + \l(A- R) v [q ^q}> \l(A- B) v [qq] + (I S ^ B : p s ), (25.33) 

and classical-assisted state transfer: 

{w s^ab . pS) + I[A . R) ^ [c _^ c] > I{A )B)\\qq\ + (I S ^ B : p s ), (25.34) 

where \<p) is a purification of p AB . We also have noisy super-dense coding 

(p) + H(A) p [q -^q]> I(A; B) p [c -> c], (25.35) 

and noisy teleportation 

(p) + /(A; B)\c - c] > 7(A>B) J? - g], (25.36) 



by combining (25.32) with the coherent communication identity and the unit protocols. 
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25.4 Protocols not covered in this book 

There are many important protocols that we did not cover in this book because our focus 
here was on channels. One such example is quantum state redistribution. Suppose that Alice 
and Bob share many copies of a tripartite state p ACB where Alice has the shares AC and 
Bob has the share B. The goal of state redistribution is for Alice to transfer the C part of 
the state to Bob using the minimal resources needed to do so. It is useful to identify a pure 
state (p RACB as a purification of p ACB , where R is the purifying system. Devetak and Yard 
showed the existence of the following state redistribution protocol [771 1267] : 



{W S^AC\B . pS) + 1 J(c , RB)[q _^ g]+ 1 J(c , A)[qq] > 



2 

{W S^A\CB . pS) + l /(Cf . B) ^ [q _^ g]+ 1 1{C . B) ^ [qql (25 _ 37) 

where W S ^ AC ^ B is some isometry that distributes the system S as AC for Alice and B for Bob 
and W S ^ A \ CB is defined similarly. They also demonstrated that the above resource inequality 
gives an optimal cost pair for the quantum communication rate Q and the entanglement 
consumption rate E, with 

Q=h(C-R\B) ip , (25.38) 



E 



I(C;A)-I(C-B) v . (25.39) 



Thus, their protocol gives a direct operational interpretation to the conditional quantum 
mutual information ^I(C;R\B) as the net rate of quantum communication required in 
quantum state redistribution. 

A simple version of the quantum reverse Shannon theorem gives a way to simulate the 
action of a channel Af A ~^ B on some input state p A by exploiting classical communication 
and entanglement J3H 123 I3"9"] : 

H(B) a [qq] + I(R; B) a [c - c] > {M A '^ B : p A '), (25.40) 

where 

a RB =U A '- B {^ RA '), (25.41) 

with ip RA a purification of p A . One utility of the quantum reverse Shannon theorem is that 
it gives an indication of how one channel might simulate another in the presence of shared 
entanglement. In the simulation of the channel Af A ^ B , the environment is also simulated 
and ends up in Alice's possession. Thus, they end up simulating the quantum feedback 



channel Uj$~* , and we can restate ( |25.40[ ) as follows: 

H(B) a [qq] + I(R; B) a [c - c] > (U^ AB : p A '). (25.42) 
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It is possible to upgrade the classical communication to coherent communication [69], leading 
to the following coherent, fully-quantum version of the quantum reverse Shannon theorem [3]: 

ll(A;B) a [qq] + \l{R;B) a [q - q] > (U^ AB : p A '). (25.43) 



Combining this resource inequality with the following one from Exercise |21.1.1 

(U^ AB : /') > \l{A- B) 9 [qq] + ^(R; B) a [q - q] (25.44) 

gives the following satsifying resource equality: 

(U^ AB : p A ') = \l{A- B) a [qq] + \l{R; B) a [q - q]. (25.45) 

The above resource equality is a generalization of the coherent communication identity. 
A more general version of the quantum reverse Shannon theorem quantifies the resources 
needed to simulate many independent instances of a quantum channel on an arbitrary input 
state, and the proof in this case is significantly more complicated [2TI I3"9"] . 

Other protocols that we did not cover are remote state preparation [28| |32| H], classical 
compression with quantum side information [74], trade-offs between public and private re- 
sources and channels |252j , trade-offs in compression [133] , and a trade-off for a noisy state 
with the three unit resources |160] . The resource inequality formalism is helpful for devising 
new protocols in quantum Shannon theory by imagining some resources to be unit and others 
to be noisy. 

25.5 Network Quantum Shannon Theory 

The field of network quantum Shannon theory has arisen in recent years, motivated by the 
idea that one day we will be dealing with a quantum Internet in which channels of increasing 
complexity can connect a number of senders to a number of receivers. A quantum multiple ac- 
cess channel has multiple senders and one receiver. Various authors have considered classical 
communication over a multiple access channel |257] . quantum communication over multiple 
access channels J148L 1269] . and entanglement-assisted protocols [156] . A quantum broadcast 
channel has one sender and multiple receivers. Various authors have addressed similar sce- 
narios in this setting |270[ [84] . A quantum interference channel has multiple senders and 
multiple receivers in which certain sender-receiver pairs are interested in communicating. 
Recent progress in this direction is in Ref. [93]. One could also consider distributed com- 
pression tasks, and various authors have contributed to this direction [HI El I210J . We could 
imagine a future textbook containing several chapters that summarize all of the progress in 
network quantum Shannon theory and the novel techniques needed to handle coding over 
such channels. Savov highlights much of this direction in his PhD thesis |211] (at least for 
classical communication) . 
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25.6 Future Directions 

Quantum Shannon theory has evolved from the first and simplest result regarding Schu- 
macher compression to a whole host of protocols that indicate how much data we can trans- 
mit over noisy quantum channels or how much we can compress information of varying 
types — the central question in any task is, "How many unit resources can we extract from a 
given non-unit resource, perhaps with the help of other non-unit resources?" This book may 
give the impression that so much has been solved in the area of quantum Shannon theory 
that little remains for the future, but this is actually far from the truth! There remains much 
to do to improve our understanding, and this final section briefly outlines just a few of these 
important questions. 

Find a better formula for the classical capacity other than the HSW formula. Our best 
characterization of the classical capacity is with a regularized version of the HSW formula, 
and this is unsatisfying in several ways that we have mentioned before. In a similar vein, find 
a better formula for the private classical capacity, the quantum capacity, and even for the 
trade-off capacities. All of these formulas are unsatisfying because their regularizations seem 
to be necessary in the general case. It could be the case that an entropic expression evaluated 
on some finite tensor power of the channels would be sufficient to characterize the capacity for 
different tasks, but this is a difficult question to answer. Interestingly, recent work suggests 
pursuing to find out whether this question is algorithmically undecidable (see Ref. |261j ). 



Effects such as superactivation of quantum capacity (see Section 23.7.2) and non-additivity 



of private capacity (see Section 22.5.2) have highlighted how little we actually know about 
the corresponding information processing tasks in the general case. Also, it is important 
to understand these effects more fully and to see if there is any way of exploiting them in 
a practical communication scheme. Finally, a different direction is to expand the number 
of channels that have additive capacities. For example, finding the quantum capacity of a 
non-degradable quantum channel would be a great result. 

Continue to explore network quantum Shannon theory. The single-sender, single-receiver 
channel setting is a useful model for study and applies to many practical scenarios, but 
eventually, we will be dealing with channels connecting many inputs to many outputs. Having 
such an understanding for information transmission in these scenarios could help guide the 
design of practical communication schemes and might even shed light on the open problems 
in the preceding paragraph. 
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Miscellaneous Mathematics 



This section collects various useful definitions and lemmas that we use throughout the proofs 
of certain theorems in this book. 

Lemma A. 0.1. Suppose that M and N are positive operators. Then the operators M + N , 

MNM, and NMN are positive. 

Lemma A. 0.2. Suppose that the operators and uo have trace less than or equal to unity. 
Suppose Co lies in the operator interval [(1 — e)u, (1 + e)u>]. Then 

||tD -a;^ < e. (A.l) 

Proof. The statement u lj lies in the operator interval [(1 — e)u, (1 + e)u]" is equivalent to 
the following two conditions: 

(1 + e)w - co = eu - (cD - w) > 0, (A. 2) 

lu - (1 - e)uj = (lu - lo) + ecu > 0. (A.3) 

Let a = Co — u. Let us rewrite a in terms of the positive operators a + and a~ 

a = a + - a~ (A.4) 



as we did in the proof of Lemma 9.1.1 The above conditions become as follows: 



euj - a > 0, (A.5) 

a + ecu > 0. (A.6) 

Let the positive projectors II + and n~ project onto the respective supports of a + and a~ . 
We then apply the projector Il + to the first condition: 

n + (ew - a)n + > (A. 7) 

=> e n + ^n + - n + an + > o (A.8) 

^ eU + uU + - a + > (A.9) 
635 
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where the first inequality follows from Lemma | A . . 1 [ We apply the projector II to the 
second condition: 

n _ (a + ew)ir>o (A.io) 

=► rrair + eii~u/ir > o (A.ii) 

=> -a' + en-^n - > (A.12) 



where the first inequality again follows from Lemma A.0.1 Adding the two positive operators 



together gives another positive operator by Lemma A.0.1 



e n + c<jn + - a + - or + eir^rr > (A.13) 

=> en + ^n + - \lu-lu\ + en-^n- > o (A.14) 

=> ecu - \u -lu\ > (A. 15) 

Apply the trace operation to get the following inequality: 

eTr{u} > Tr{\u - cv\} = \\u) - ujW-l (A.16) 

Using the hypothesis that Tr{a>} < 1 gives the desired result. □ 

Theorem A.0.1 (Polar Decomposition). Any operator A admits a left polar decomposition: 

A = uVa^A, (A.17) 

and a right polar decomposition: 

A = VaA^V. (A. 18) 

Proof. We give a simple proof for just the right polar decomposition by appealing to the 
singular value decomposition. Any operator A admits a singular value decomposition: 

A = U{EU 2 , (A.19) 

where U\ and U2 are unitary operators and S is an operator with positive singular values. 
Then 

AA ] = U 1 EU 2 U 1 2 EUl = UxTfUl, (A.20) 

and thus 

VXLF= Z7iEZ7j. (A.21) 

We can take V = U-JJ-i and we obtain the right polar decomposition of A 

\Taaw = ujxilu&z = uj:u 2 = a. (A.22) 

□ 
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Lemma A. 0.3. Let A be any operator and U be a unitary operator. Then 

\Tr{AU}\ < Tr{\A\}, (A.23) 

with saturation of the equality when U = V T , where A = \A\V is the right polar decomposition 
of A. 

Proof. It is straightforward to show equality under the scenario stated in the theorem. It 
holds that 

\Tr{AU}\ = \Tt{\A\VU}\ = TtUa\*\A\*Vu\ . (A.24) 

It follows from the Cauchy-Schwarz inequality for the Hilbert- Schmidt inner product that 

\Tr{AU}\ < y/Tr{\A\}Tr{WVi\A\VU} = Tr{\A\}. (A.25) 

□ 

Lemma A. 0.4. Consider two collections of orthonormal states (\Xj))je[N] an d (\(j))je[N] 
such that (Xj\Cj) > 1 — e for all j. There exist phases 7j and 8j such that 

<X|C> > 1 - e, (A26) 

where 

1 N 

1 N 
IO = 7^E e ^lO)- (A.28) 

i=i 

Proof. Define the Fourier transformed states 

1 N 



N . 
and similarly define |^ s ). By Parseval's relation, it follows that 

N N 

^£<*<ic.) = ]v£tel<*>>i-e. ( A - 3 °) 

s=l j=l 

Thus, at least one value of s obeys the following inequality: 

e tes (Xs\L)>l-^ (A.31) 

for some phase 9 S . Setting 7^ = 2-kjs/N and Sj = 7^ + 9 S satisfies the statement of the 
lemma. □ 
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A.l The Operator Chernoff Bound 

In this section, we provide the proof of Ahlswede and Winter's Operator Chernoff Bound 
from Ref. [7]. Recall that we write A > B if A — B is a positive semidefinite operator and 
we write A^ B otherwise. 

Lemma A. 1.1 (Operator Chernoff Bound). Let £i, . . . , £m be M independent and identically 
distributed random variables with values in the algebra B{7i) of bounded linear operators on 
some Hilbert space TC. Each £ m has all of its eigenvalues between the null operator and the 
identity operator I : 

Vme[M]:0<( m < /. (A.32) 

Let £ denote the sample average of the M random variables: 

1 - 

e = m E &»■ ( A - 33 ) 

m=l 

Suppose that the expectation E^{^ m } = \x of each operator £ m exceeds the identity operator 
scaled by a number a G (0, 1): 

\x > al. (A.34) 

Then for every r] where < r] < 1/2 and (1 + rf)a < 1, we can bound the probability that the 
sample average £ lies inside the operator interval [(1 ± r])fj]: 

Pr{(l - V )n<Z< {l + v)l*} > 1 " 2dim ^ ex p("^|) • ( A - 35 ) 

Thus it is highly likely that the sample average operator £ becomes close to the true expected 
operator ^ as M becomes large. 

We prove the above lemma in the same way that Ahlswede and Winter did, by making 
a progression through the Operator Markov inequality all the way to the proof of the above 
Operator Chernoff Bound. 

Lemma A. 1.2 (Operator Markov Inequality). Let X be a random variable with values in 
the algebra B + (7i) of positive bounded linear operators on some Hilbert space 7i. Let E{A} 
denote its expectation. Let A be a fixed positive operator in B + (TC). Then 

Pr{A ^ A} < Tr{E{X}A~ 1 }. (A.36) 

Proof. Observe that if X ^ A then A~ 1/2 XA~ 1/2 ^ /. This then implies that the largest 
eigenvalue of A~ 1 ' 2 XA~ 1 ' 2 exceeds one: || ^4 — 1 ' 2 J s iT J 4 — x / 2 1| > 1. Let Ix£a denote an indicator 
function for the event X ^ A. We then have that 

IxiA < Tx{A- l ' 2 XA-^ 2 }. (A.37) 
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The above inequality follows because the RHS is non-negative if the indicator is zero. If 
the indicator is one, then the RHS exceeds one because its largest eigenvalue is greater than 
one and the trace exceeds the largest eigenvalue for a positive operator. We then have the 
following inequalities: 

Pr{X £ A} = Eh x ^ A X < E{Tr{yl- 1/2 XA- 1/2 }} (A.38) 

= E{Tt{XA~ 1 }} = Tr{E{X}A~ 1 }, (A.39) 



where the first inequality follows from (A. 37) and the second equality from cyclicity of 



trace. □ 

Lemma A. 1.3 (Bernstein Trick). Let X , X\, . . . , X n be IID Hermitian random variables 
in B(TC), and let A be a fixed Hermitian operator. Then for any invertible operator T , we 
have 

Pri ^X k ^ nA\ < dim(H)\\E{exp{T(X - A)T j }}\\ n . (A.40) 

Proof. The proof of this lemma relies on the Golden- Thompson inequality from statistical 
mechanics which holds for any two Hermitian operators A and B (which we state without 
proof): 

Tr{exp{yl + B}} < Tr{exp{A} exp{B}}. (A.41) 

Consider the following chain of inequalities: 

Prj J2 Xk i nA \ = Pr | lt( Xk ~ A ) £ ° I ( A - 42 ) 

= Pr|^T(A fc -yl)rt^ol (A.43) 

= Prj expj Y,T{X k - A)T^ W / 1 (A.44) 

< Tri e| expj f>(A fc " A)T* 1 1 1 (A.45) 

The first two equalities are straightforward and the third follows because A < B is equivalent 
to exp{A} < exp{i3} for commuting operators A and B. The first inequality follows from 
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applying the Operator Markov inequality. Continuing, we have 

= E j Tr j exp j ]T T{X k - A)T^ I I 1 (A.46) 

< e| Tr j expj Y^T(X k - A)T^ \ exp{T(X n - A)T*} 1 1 (A.47) 
= E Xlv .., Xn _J Tr| expj ^T(X fc - A)T* [v. Xn {exp{T(X n - A)T*}} 1 1 (A.48) 

= E Jfl) ...,x„_ 1 | Tr| expj ^T(X fc - A)T* lE x {exp{T(X - A)T^}} \ 1 (A.49) 

The first equality follows from exchanging the expectation and the trace. The first inequality 
follows from applying the Golden- Thompson inequality. The second and third equalities 
follow from the IID assumption. Continuing, 

< E JCli ... jXn _J Tri expj ^T(X fc - A)T* 1 1 1 ||E x {exp{T(A - A)T*}}\\ (A.50) 

<Tr{/} ||E x {exp{T(X-^)Tt}}|| n (A.51) 

= dim(H) llExjexpJT^-A)^}}!!" (A.52) 

The first inequality follows from Tx{AB} < Tr{A}\\B\\. The second inequality follows from 
a repeated application of the same steps. The final equality follows because the trace of the 
identity operator is the dimension of the Hilbert space. This proves the "Bernstein trick" 
lemma. □ 

We finally come closer to proving the Operator Chernoff Bound. We first prove that the 
following inequality holds for IID operators X , X\, . . . , X n such that E{A} < ml, A > al, 
and 1 > a > m > 0: 

Pr l S Xfc ^ nA \ - dim 0^0 exp{-nD(o||m)}, (A.53) 

where D(a\\m) is the binary relative entropy: 

D(a\\m) = a log a — alogm + (1 — a) log(l — a) — (1 — a) log(l — m), (A. 54) 



where the logarithm is the natural logarithm. We first apply the Bernstein Trick (Lemma A. 1.3 ) 
with T = Vtl: 

Pr 1 5Z Xk ^ nA \ - Pr i Yl Xk ^ naI \ ( A - 55 ) 

< dim(ft)||E{exp{£A}exp{-£a}}|| n . (A.56) 
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So it is clear that it is best to optimize t in such a way that 

||E{exp{tX}exp{-£a}}|| < 1 (A.57) 

so that we have exponential decay with increasing n. Now consider the following inequality: 

expjtX} - I < X(exp{t} - 1), (A.58) 

which holds because a similar one holds for all real x G (0, 1): 

-(exp{te} - 1) < exp{£} - 1. (A.59) 

x 

Applying this inequality gives 

E{exp{tX}} < E{A}(exp{t} - 1) + / (A.60) 

< ra/(exp{i} - 1) + / (A.61) 

= (mexp{i} + l -m)I. (A.62) 

which in turn implies 

||E{exp{tA}exp{-ta}}|| < (raexp{£} + 1 -ra)exp{-£a}. (A.63) 

Choosing 

fa 1 — m\ 

t = log > A.64 

\m 1 — a J 

(which follows from the assumption that a > m) gives 
(mexp{t} + 1 — m) exp{— ta} 



fa 1 — m\ \ { fa 1 — m\ \ 

ml — • + 1 — raj exp< — log — • a > (A. 65) 

\ra 1 — a ) J \ \m 1 — a J J 

("•TTT + 1 - m ) ex p{- alo s©- al °s(^)} < A - 66 > 

expj-alog(^) - (1 - a) log(^J } (A.68) 

exp{-D(a || ra)}, (A.69) 



proving the desired bound in (A. 53). 



By substituting Yf. = I—X^ and B = I — A into (A. 53) and having the opposite conditions 
E{A} > ml, A < al, and < a < ra < 1, we can show that the following inequality holds 
for IID operators X, Xi, . . . , X n : 

Pr I Y^ x k t- nA \ ^ dim(W) exp{-nD(o||ra)}. (A.70) 
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To finish off the proof of the Operator Chernoff Bound, consider the variables Z{ = 
Mfi-^Xifi- 1 / 2 with n = E{X} > MI. Then E{Z t } = MI and 0< Z { < I. The following 
events are thus equivalent 

(l-»?)/i<-V^<(l+# <=> (l-77)M/<-Vz t <(l +? 7)M/, (A.71) 
n z — ' n *-^ 

i=l t=l 



and we can apply (A. 53), (A. 70), and the union bound to obtain 



<dim(ft) exp{-nD((l-?7)M||M)} + dim(W) exp{-nD((l + t/)M||M)} (A.73) 
< 2dim(W) exp|-n -L— j, (A.74) 

where the last line exploits the following inequality valid for —1/2 < 77 < 1/2 and (1 + rj)M < 
1: 

D((l + 77)M| |M) > t^M. (A.75) 
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Monotonicity of Quantum Relative 

Entropy 



The following proof of monotonicity of quantum relative entropy is due to Nielsen and 
Petz [135] . 

Theorem B.0.1. For any two bipartite quantum states p XY and a XY , the quantum relative 
entropy is monotone under the discarding of systems: 

D(p XY \\a XY )>D(p x \\a x ). (B.l) 

Proof. We first require the notion of operator convexity. Recall the partial order A > B for 
Hermitian operators A and B where A > B if A — B is & positive operator. A function / 
is operator convex if for all n, for all A,B G M n (where M n is the set of n x n Hermitian 
operators), and for all A G [0, 1], we have 

f(XA + (1 - X)B) < \f(A) + (1 - X)f(B). (B.2) 



We require Lemma B.0.4, that states that — ln(x) is an operator convex function and 
Lemma B.0.5, that states that if / is an operator convex function and JJ V ^ W is an isom- 
etry (where dim(V) < dim(W)), then f{U ] XU) < U ] f(X)U for all operators X. We 
need to reexpress the quantum relative entropy using a linear map on matrices known as 
the relative modular operator. We are assuming that p and a are invertible when defin- 
ing this operator, but the proof technique here extends with a continuity argument. Let 
C(A) = oA and 7Z(A) = Ap~ x . The relative modular operator A is the product of these 
linear maps under composition: A (A) = £(TZ(A)). The two superoperators C and TZ com- 
mute so that A(A) = 1Z(£(A)). We now define a function In on superoperators £, where £ 
is a linear map that is strictly positive with respect to the Hilbert-Schmidt inner product 
(A, B) = TrJA^U}. To define this function, we expand the linear map £ in a basis where it 
is diagonal: £ = ^2 x p x £x, so that ln(£) = ^2 x ln(p x )£ x . Observe now that C(A), TZ(A), and 
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A(A) are all strictly positive for A ^ because 

(A, C(A)) = Tr{ AW} = Tr{yL4V} > 0, (B.3) 

(A,H(A)) = Tr {A^Ap- 1 } > 0, (B.4) 

(A,A(A)) = Tr{A^aAp- 1 } > 0, (B.5) 

and our assumption that p and a are positive, invertible operators. After expanding the 
maps C(A) and 11(A) in a diagonal basis, observe that \n(C)(A) = ln(cr)A and \n(1Z)(A) = 
—A\n(p). Also, we have that 

ln(A) = m(£) + ln(1Z) (B.6) 

because £ and 1Z commute. We can then rewrite the quantum relative entropy as follows: 

D(p\\a) = Tr{plnp- plna} (B.7) 

= Tr{plnp}-Tr{plna} (B.8) 

= Tr{p 1 /V /2 lnp} -Trl/^y^lna} (B.9) 

= -Tr{p 1 / 2 ln(1l)(p 1 / 2 )}- Tx{\n(C)(p l l 2 )p 1 ' 2 } (B.10) 

= Tr{p^ 2 [-ln(1l)(p^ 2 ) -ln(C)(p^ 2 )]} (B.ll) 

= <p 1 / 2 ,-ln(A)(pV 2 )>. (B .l2) 

The statement of monotonicity of quantum relative entropy then becomes 

((pV 2 ,-ln(A*)((p*) 1/2 )) < ((p Xy ) 1/2 ,-ln(A^)((p^) 1/2 )), (B.13) 

where 

A x (A) = a x A(p x )~\ (B.14) 

A XY (A)=a XY A(p XY y\ (B.15) 

To complete the proof, suppose that an isometry U from X to XY has the following prop- 
erties: 

U ] A XY U = A x , (B.16) 

u((p X ) 1/2 )=(p XY f 2 - (B.17) 

We investigate the consequences of the existence of such an isometry and later explicitly 
construct it. First, we rewrite monotonicity of quantum relative entropy in terms of this 
isometry: 

((p x ) 1/2 ,-ln(U^A XY U)((p x ) 1/2 )) < ((p XY ) 1/2 ,-ln(A XY )((p XY ) 1/2 )). (B.18) 

Now we know from Lemmas |B.0.4| and |B.0.5| on operator convexity that 

- \n(U ] A XY U) < -Uhn(A XY )U, (B.19) 
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so that 



(p x ) 1/2 ,-ln(U^A XY U)((p x ) 1/2 
<(( P X ) 1/2 ,-Uhn(A XY )u((p x ) 1/2 

u((p x ) 1/2 ),-m(A XY )u(( P x ) 

(p XY ) 1/2 ,-ln(A XY )((p XY ) 1/2 )). 



(B.20) 
x ^\ (B.21) 

(B.22) 

This completes the proof of monotonicity of quantum relative entropy if it is true that there 



exists an isometry satisfying the properties in (B.16 B.17). Consider the following choice for 
the map U: 

U(A) = (A{p x )- l/2 ®I Y ){p XY ) l/2 . (B.23) 



This choice for U satisfies (B.17) by inspection. The adjoint W is some operator satisfying 

(B,U(A)) = (U\B),A) 



for all A G TC X and B e 7i XY . Thus, we require some operator W such that 

Tr {[U\B)} ] A} = (U\B), A) 
= (B,U(A)) 
= Tr{B^A{p x y 1/2 ®I Y )(p XY ) 1/2 } 

= Tt{({p x )- 1I2 ®i y ){p xy ) 1/2 b^a}. 



(B.24) 

(B.25) 
(B.26) 
(B.27) 

(B.28) 



So the adjoint W is as follows: 

UHB) = Tr Y {B(p XY ) 1/2 ((p x y l/2 ®I Y )}. 



(B.29) 



We can now verify that (B.16) holds 

U ] A XY U(A) 



{{■ 



Try a 



.XY 



A( P *) 



X\-l/2 



/2 



Ky{{a XY [(A{ P x y l 
Tr Y {a XY A{p x Y 1 ® I Y ) 

a x A(p x y 1 

A X (A). 



I Y 

I Y )l XY 



}((p x r /2 ®i 



")} (B-30) 

(B.31) 

(B.32) 

(B.33) 
(B.34) 
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The map U is also an isometry because 

U\U{A)) = Tr Y {(A{ P x y 1/2 ® I Y ) (P XY ) 1/2 (P XY ) 1/2 {(PT 1/2 ® ^) } (B-35) 

= Tr y { (a( P x )- 1/2 ® J y )p* y ((p x )" 1/2 (8 / Y ) } (B.36) 

= (^(p x )- 1/2 ®/ y )p x ((p x )- 1/2 ®/ y ) (B.37) 

= A. (B.38) 

This concludes the proof of the monotonicity of quantum relative entropy Observe that 
the main tool exploited in proving this theorem is the operator convexity of the function 
-ln(a:). □ 

Lemma B.0.4. — ln(x) is an operator convex function. 

Proof. We begin by proving that x~ l is operator convex on (0, oo). Let A and B be strictly 
positive operators such that A < B. Begin with the special case where A = I and the goal 
then becomes to prove that 

(XI + (1 - X)B)~ 1 < XI + (1 - X)B~\ (B.39) 

The above result follows because I and B commute and the function x~ l is convex on the 
real numbers. Now make the substitution B — > A~ X I 2 BA~ X I 2 in the above to obtain 

(XI + (1 - X)A- 1/2 BA~ 1/2 )~ 1 < XI + (1 - X)(A- 1/2 BA- 1/2 )~\ (B.40) 

Conjugating by A~ l l 2 gives the desired inequality: 

A- l ' 2 (XI + (1 - X)A- l ' 2 BA- l / 2 y l A- 1 ' 2 

< A- l ' 2 (xi + (1 - X)(A-V 2 BA-V 2 y l )A- 1 / 2 , (B.41) 

.-. (XA + (1 - A)^)" 1 < A^" 1 + (1 - X)B~\ (B.42) 

We can now prove the operator convexity of — ln(A) by exploiting the above result and the 
following integral representation of — ln(a): 

f°° 1 1 

_ln(a)= / dt (B.43) 

Jo a + t 1 - 1 

The following integral representation then follows for a strictly positive operator A: 

poo 

-ln(A)= dt (A + tiy 1 - (I + tiy 1 . (B.44) 

Jo 

Operator convexity of — In (A) follows if (A + tl)~ is operator convex: 

[XA + (i - x)b + tiy 1 < x(a + tiy 1 + (i - x)(b + tiy 1 . (B.45) 
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The above statement follows by rewriting the LHS as 

(X(A + tI) + (l-X)(B + tI)y 1 (B.46) 

and applying operator convexity of a: -1 . □ 

Lemma B.0.5. If f is an operator convex function and U v ^ w is an isometry (where 
dim(V) < dim(W);, then f(WXU) < Wf(X)U for all operators X. 

Proof. First recall that f(WXU) = Wf(X)U if U maps V onto W. This follows from the 
way that we apply functions to operators. The more difficult part of the proof is proving 
the inequality if U does not map V onto W. Let II be a projector onto the range W of U 
(W is the subspace of W into which the isometry takes vectors in V"), and let II = / — IT be 
a projector onto the orthocomplement. It is useful to adopt the notation fy, fw, and fw 
to denote the three different spaces on which the function / can act. Observe that UU = U 
because II projects onto the range of U. It follows that 

f v (WXU) = f v (WU(UXU)UU), (B.47) 

and we can then conclude that 

f v (WU(UXU)UU) =U j Uf w ,(UXU)UU, (B.48) 

from our observation at the beginning of the proof. It then suffices to prove the following 
inequality in order to prove the inequality in the statement of the lemma: 

f w/ (UXU)<Uf w (X)U. (B.49) 

Consider that 

f w >(uxu) = n/ w (nxn)n = Uf w (nxn + nxn) n, (B.50) 

because 

f w (nxn + nxri) = f w (uxu) + f w (mn) , (B.51) 

n/ w (n^n)n = o. (B.52) 

Let S = II — II be a unitary on W and recalling that II + II = /, it follows that 

x + sxs^ _ (n + n)x(n + n) + (n-n)x(n-n) 

2 ~ 2 [ } 

= nxn + rixn. (B.54) 
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We can then apply the above equality and operator convexity of / to give 

f w (llXIl + flxfl) = f w (^-— 2 J (B.55) 

<i[/V(X) + / w (SXSt)] (B.56) 

= l -[f w (X) + Sf w (X)tf] (B.57) 

= n/ x (x)n + n/ x (x)fi. (b.58) 



Conjugating by n and recalling (B.50) gives the desired inequality that is sufficient to prove 



the one in the statement of the lemma: 

ii f w (nxn + fixfi) n < n/ x (X)n. (B.59) 
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